What is chaos engineering?
Chaos engineering is the practice of experimenting on a system to build confidence in the system’s ability to weather turbulent conditions in production. In other words, chaos experiments are designed to put systems under enormous strain, intentionally breaking them to discover where their weaknesses are. Armed with this knowledge, developers can fix flaws before they cause the system to break while in production, thus preventing outages and ensuring better user experiences.
The rise of chaos engineering is attributed to Netflix’s move to an AWS cloud-based infrastructure in 2010. To protect the experience of their customers, Netflix engineers began conducting chaos experiments to ensure they could continue to deliver quality streaming services even if experiencing downtime from Amazon servers.
Chaos engineering is designed to answer specific questions about the resiliency and functionality of systems, such as:
- What happens when a system has too much traffic or when it’s not available to users?
- What types of cascading errors occur when a single point of failure crashes an application?
- What happens when there are problems with networking?
- What happens when a specific service can’t be accessed or when specific applications go down?
As a result of chaos testing, IT teams can see how systems respond to a variety of pressures in real time. It reveals bugs and weaknesses that other testing methodologies cannot. Chaos experiments also better prepare IT teams to deal with real-world failures, reducing response times when problems occur in production environments.
Benefits of chaos engineering
Chaos engineering offers a number of critical benefits over other types of testing.
- Build confidence in a system’s ability to withstand complex, real-world issues. Chaos testing allows IT and DevOps teams to more accurately identify and fix issues that might not be captured with other types of manual or automated software testing.
- Improve knowledge of system design. Chaos experiences are especially useful at helping IT teams understand and find weaknesses in large, complex systems such as cloud-based applications and services that must often scale rapidly. With this knowledge, IT and DevOps teams can design and build more robust systems.
- Enhance service availability. Through proactive and continual chaos testing, IT teams can reduce unplanned downtime and outages to deliver better customer experience.
- Protect revenues and improve scalability. By minimizing outages, chaos engineering helps to prevent revenue losses. Chaos tests also provide IT teams with the knowledge they need to design systems to scale rapidly to meet spikes in demand.
Principles of chaos engineering
There are several core principles for chaos engineering that represent best practices.
- Build a hypothesis. Start by identifying a “steady state” – a control that defines a measurable output of the system and represents typical behavior.
- Simulate real-world occurrences. Using both a control group and an experiment group, introduce variables based on real-world events like malfunctioning hard drives, severed network connections, spikes in traffic, or servers that crash.
- Run automated experiments continuously. To improve results while minimizing costs, chaos engineering uses automation to continually orchestrate experiments and analyze results.
- Experiment with systems in production. Because systems behave differently in staging vs. production environments, chaos engineering must focus on experimenting with production traffic.
- Minimize the blast radius. Because experimenting in production can adversely impact the customer experience, initial chaos experiments should start with a small area of impact and grow as confidence in the system increases.
Typical chaos experiments
The type of chaos tests conducted on a system depend on its architecture and on the goals of the business. Some of the most common tests include:
- Creating sudden spikes in traffic
- Shutting down a virtual machine to see how the system reacts
- Simulating a high CPU load
- Simulating resource exhaustion
- Creating latency between services
- Introducing unreliability in the network
- Simulating a saturation in data storage
- Causing DNS to become unavailable
- Emulating I/O errors
- Breaking the connection between systems and the data center
- Simulating failed components
- Introducing function-based chaos by randomly causing functions to throw exceptions
- Preventing system clocks from syncing
Solutions for chaos engineering from Tricentis
Applications have become more complex and distributed. Performance testing and chaos engineering are a powerful combination to prepare complex systems for peak traffic under any condition to maximize the potential of distributed systems. Performing load tests in dev/test environments only tests how your application will handle traffic in ideal conditions. However, things break and dependencies fail. Ensure your applications can perform to expectations in both ideal and degraded environments, so that even if something fails, your customers will remain unimpacted and happy with your product.
Using Neoload and Gremlin, you can easily simulate large amounts of traffic during common failure scenarios. Identify and improve parts of your system that are prone to failure or are unable to scale efficiently. Monitor how your system degrades during failure scenarios to decide on areas for investment to improve customer uptime.
Together, Tricentis NeoLoad and Gremlin enable testing teams to answer the questions:
- Did autoscaling kick in and handle the extra load?
- Did a small amount of backend latency cascade to a large amount of frontend latency?
- Does a non-critical service experiencing an outage lead to frontend errors or slow performance for end users?
- How would an outage from a third party provider impact end users?
Launch Gremlin attacks from NeoLoad to synchronize performance tests with chaos experiments to automate and maximize the benefits of combining testing suites. The integration between NeoLoad and Gremlin is the best automated solution for guaranteeing the performance and reliability of systems. Along with NeoLoad, Tricentis offers a suite of testing solutions and test management tools that support continuous integration throughout the software testing lifecycle.