Chaos engineering is the practice of injecting failures into your production systems in a controlled manner to identify weaknesses in your applications. In order to build, run, and report on chaos experiments Sylvain Hellegouarch created the Chaos Toolkit. In this episode he explains his motivation for creating the toolkit, how to use it for improving the resiliency of your systems, and his plans for the future. He also discusses best practices for building, running, and learning from your own experiments.
This episode of Podcast.__init__ is brought to you by Clubhouse, the first project management platform for software development that brings everyone together so that teams can focus on what matters – creating products their customers love. Clubhouse provides the perfect balance of simplicity and structure for better cross-functional collaboration. Its fast, intuitive interface makes it easy for people on any team to focus-in on their work on a specific task or project, while also being able to “zoom out” to see how that work is contributing towards the bigger picture. With a simple API and robust set of integrations, Clubhouse also seamlessly integrates with the tools you use everyday, getting out of your way so that you can deliver quality software on time.
Listeners of Podcast.__init__ can sign up for two free months of Clubhouse by visiting pythonpodcast.com/clubhouse.
Do you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2020 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Sylvain Hellegouarch about Chaos Toolkit, a framework for building and automating chaos engineering experiments
- How did you get introduced to Python?
- Can you start by explaining what Chaos Engineering is?
- What is the Chaos Toolkit and what motivated you to create it?
- How does it compare to the Gremlin platform?
- What is the workflow for using Chos Toolkit to build and run an experiment?
- What are the best practices for building a useful experiment?
- Once you have an experiment created, how often should it be executed?
- When running an experiment, what are some strategies for identifying points of failure, particularly if they are unexpected?
- What kinds of reporting and statistics are captured during a test run?
- Can you describe how Chaos Toolkit is implemented and how it has evolved since you began working on it?
- What are some of the most challenging aspects of ensuring that the experiments run via the Chaos Toolkit are safe and have a reliable rollback available?
- What have been some of the most interesting/useful/unexpected lessons that you have learned in the process of building and maintaining the Chaos Toolkit project and community?
- What do you have planned for the future of the project?
Keep In Touch
- Chaos Toolkit
- Chaos IQ
- Gremlin chaos engineering service
- Russ Miles Chaos IQ co-founder
- CherryPy minimalist Python web framework
- Cherrypy Essentials book
- Chaos Engineering
- Chaos Engineering Book
- SRE (Site Reliability Engineering)
- Dark Debt
- Netflix Simian Army
- Chaos Monkey
- Istio service mesh
- Chaos Platform
- Composition vs Inheritance
- Open Chaos Initiative