Resiliency Engineering

Resiliency engineering is a methodology used to increase the overall stability, response outcomes to component failures, and ensure adequate mitigation techniques are implemented and functioning as expected.

Reason for Topic

In modern, distributed software systems, you never know where your weakest link will be. It could be an existing service running low on resources, poor auto-scaling rules, unplanned restarts, or a system patch to infrastructure. Even simply a new code change introduced can lie dormant until, under specific conditions, pops up to make the end-user experience grind to a halt. Some of the biggest digital companies, not just SMBs and resource-constrained enterprises, suffer from these unknowns.  

 

What Could Possibly Go Wrong?

(Ref: https://engineering.grab.com/chaos-engineering) 

 

What do you do about the ‘unknown unknowns’ lurking in your apps and services? How can you separate what’s really ‘unknowable’ from what is simply ‘unknown’ today? 

Resiliency Engineering is a discipline that has emerged as an accepted, often even required practice of microservice architectures and cloud-native systems to prove and improve the resilience of a system by causing errors to occur within the system and monitoring for downstream issues. It does not replace other traditional forms of verifying correctness, performance, or security; rather, it adds to the overall readiness of systems when integrated together to help teams deliver changes with confidence. 

 

Introduction / Definition

Resiliency engineering is a methodology used to increase the overall stability, response outcomes to component failures, and ensure adequate mitigation techniques are implemented and functioning as expected. Chaos testing is one of a subset of practices in resiliency engineering, meant to test the resilience of software systems by intentionally introducing chaos or failure scenarios. The objective is to identify vulnerabilities and weaknesses in the system before they cause an outage in the production environment.  

Benefits & Examples

A well-managed resiliency program, as part of a broader quality engineering practice, often results in: 

  • Identifying weak spots and scalability problem areas 
  • Building more resilient systems to minimize risk of disaster 
  • Improving operational visibility and promoting collaboration 
  • Enhancing existing testing and infrastructure management processes 

Resiliency engineering can help identify weak spots in a system that may be difficult to discover through other means. With the right planning and guardrails in place, introducing controlled chaos into the system helps engineers observe how it reacts and identify potential areas of concern. Chaos testing can also be used to verify the scalability of a system. By introducing chaos and failure scenarios, engineers can observe how the system responds and identify any bottlenecks or limitations that may prevent the system from scaling effectively. 

 

-> How Is Chaos Testing Related to Resiliency Engineering?

 

As terminology has changed over the past decade, some people may call it “Chaos Testing”, “Chaos Engineering”, or “Resiliency Engineering” interchangeably. These days, Resiliency is considered the higher goal-oriented topic whereas Chaos is one of multiple complimentary practices in achieving desired levels of Resiliency. 

A great simple definition of ‘Chaos Engineering’ is: “…thoughtful, planned experiments design to reveal improvements in our systems”, as Tammy Butow (Statype) puts it. 


(Ref: ”Bringing Chaos to DevOps – DevOps Unbound EP 23”) 

 

More formally, Chaos Testing is a methodology for intentionally introducing controlled chaos or failure scenarios into a system to verify its resilience and identify potential vulnerabilities. The objective is to proactively identify weaknesses in the system before they cause an outage in the production environment. Chaos testing involves simulating real-world scenarios, such as network failures, server crashes, or sudden spikes in traffic, to see how the system responds. The results of these tests can be used to make improvements to the system and enhance its overall resilience. This process is hypothesis driven, like the scientific method, often very specific in its path to prove or disprove in a precise manner if specific reliability factors satisfy the technical requirements of a larger system. 

(Ref: Navya Dwarakanath, Performance Engineer, Catchpoint, May 2018) 

Resiliency engineering, on the other hand, is a broader approach to engineering reliable and resilient systems. Resiliency engineering focuses on designing and implementing systems that are able to recover quickly from disruptions and continue to function in the face of unexpected events. Resiliency engineering emphasizes the importance of designing systems with built-in redundancy, fault tolerance, and self-healing capabilities. The objective of resiliency engineering is to minimize the impact of disruptions and ensure that the system can continue to operate despite the presence of failures. 

Unlike chaos testing, resiliency engineering also focuses on people systems and processes, not simply their outputs (the systems under test). As Erik Hollnagel puts it, “The focus of resilience engineering is thus resilient performance, rather resilience as a property (or quality) or resilience in a ‘X versus Y’ dichotomy.” (ref) Hollnagel also proposes that resiliency engineering helps teams “…respond appropriately to both disturbances and opportunities.” 

In practice, resiliency engineering for mature software organizations includes: 

  • System Optima and Business Outcome Definitions 
  • Chaos Testing, Detection, and Monitoring 
  • Capacity Planning and Auto-scale Rule Design 
  • Experimentation as a Normative Practice 
  • Identification of System Kinds, Ratings Methodologies, and Mitigation Techniques 
  • Known Paths to Resource Coordination and Clear Communications Procedures 
  • Education, Training and Advocacy of the Above 

In essence, chaos testing is a specific technique that can be used as part of a broader resiliency engineering approach. Injected chaos events (such as terminating instances, interrupting connectivity, or simulating functional faults) helps to identify specific points of weakness in a system, while resiliency engineering is focused on designing systems that are inherently resilient and able to withstand unexpected events. Both approaches are important for building reliable and resilient software systems that can continue to operate in the face of disruptions and failures. 

 

-> Intentionally Driving Out ‘Predictable Unreliability’

 

Principled engineers know that digital systems, especially those built on complex stacks of other people’s technology, often include a host of unchecked assumptions, constraints, tolerances, and edge cases. Connect these together (as modern app development teams do) and you get ‘predictable unreliability’, an incredibly uncomfortable side-effect of the modern software landscape and one that’s dangerous to ignore. These problems are predictable because we know we are going to run into reliability issues in any complex, distributed system…we just don’t know when or where. Should we simply sit back and comfortably wait until these problems occur? 

Predictable Unreliability – lots to do before getting to true “Unknown-Unknowns” 

(Ref: ”5 Myths and Anti-patterns to Refactor Out with a Continuous Performance Mindset“,
Paul Bruce, EuroStar 2022)
 

By testing a system’s resilience to chaos and failure, we (engineers) can build more resilient systems that withstand unexpected events. Chaos testing enhances existing functional and non-functional testing processes by introducing new scenarios and failure modes that may not have been considered before. This improves the overall quality of the system and reduces the likelihood of outages due to unforeseen ‘contributing factors’. Chaos testing can be used to validate that disaster recovery procedures are effective and can be executed quickly in the event of a real disaster. 

Resiliency engineering practices often help to improve the understanding and interpreting of system telemetry (e.g., monitoring and observability) by providing greater context into how it behaves under stress. This applies to production environments for sure, but also to pre-release testing and development work in lower environments. It can also promote collaboration between teams by providing a common goal and a shared understanding of the system’s behavior. This can help break down silos and encourage cross-functional collaboration. 

 

-> Which Systems Are (More) Resilient to Change?

 

Systems that are built with key principles such as failover/fallback scenarios, scalability, and recovery process time, and performance are far more likely to have the proper preventative mechanisms and team competencies in place so as not fall prey to instabilities and emergent [bad] behaviors.  

Not only do the system architectures and technologies have to change, the development and delivery practices must as well…but this often takes time. Legacy “big bang” deployment processes and highly coordinated (often manual) rollouts are signs that things likely need to be decomposed to smaller and more frequently changeable constituent parts. Loosely coupled architecture approaches such as asynchronous event-driven messaging and serverless frameworks may help to break apart monoliths, but also lead to untracked or poorly diagnosable interdependence over time.  

In either case, a well-formed Resiliency Engineering approach helps to ensure that systems built in either fashion have as few ‘knowable unknowns’ when delivered to production users.  

(Ref: Considerations on Resiliency Chaos Engineering) 

 

Drawbacks / Gotchas

As with all approaches to problems in technology, there are also challenges when introducing as well as maturing your resiliency engineering practices. A few common issues are: 

  • Big Splash: Overcommitting to a big, new ‘Chaos Engineering’ initiative without technical maturity can lead to affecting production users, downtime, and loss of reputation 
  • Poor Mapping: Lack of authority, authorization, cooperation, and trust when investigating resiliency issues slows or even blocks improvement efforts 
  • Inflammation: Injecting failures in already existing problematic environments doesn’t help 
  • No ‘normal’ state: not understanding what a system’s “normal” leads to misunderstandings, confusion across teams, and ultimately non-optimal use of resources to improve resiliency 
  • Poor communication and collaboration: while you may eventually want to test your engineers and their response time to incidents, it’s likely to build more trust by starting with planned scenarios that are well coordinated, controlled, and communicated about before and afterwards 

(Ref: “Let’s Get Ready for Chaotic Engineering”, Suzan Mahboob, DevOpsDays Boston 2019) 

In a 2019 presentation by TD Cloud Site Reliability Engineer, Susan Mahboob, “Getting to a healthy, normal state is what I consider pre-chaotic engineering.” In essence, identifying and addressing known and fixable problems already encountered isn’t just a practical approach to improving resiliency, it’s critical to do before introducing (even controlled) chaos into the situation so that you’re not convoluting contributing factors.  

Another proposed talk title was “You Are Not Ready for Chaos Engineering”, which coming from an expert on the topic, brings light to how often teams go overboard on new initiatives. In other words, when injecting failures into critical points in complex distributed system architectures, you probably want to start in lower environments or on duplicates in parity with production rather than immediately trying to implement chaotic testing tactics on end-user facing systems. This allows teams to build up competence over behaviors of the system-as-a-whole in both ‘normal state’ and under controlled chaos experiments. 

 

Summary

Resiliency engineering is a critical practice in modern software quality engineering because it helps to ensure that both systems and processes can continue to function even in the face of unexpected events or failures. It promotes a proactive approach to mitigating these issues and can help organizations ensure that their systems remain reliable and functional over time. While it does not come without its own considerations and costs, a well-formed approach to resiliency can help minimize the impact of incidents, reduce downtime and maintenance costs, improve technical debt and team velocity…it can even lead to increased customer satisfaction. Organizations are encouraged to implement resiliency improvement practices with proper expertise, management, and internal advocacy to extend its benefits to all software teams.