Organisations are now very much concerned about reliability. To improve reliability, organizations must ensure resiliency in addition with other factors like functionality, performance, stability, usability, accessibility. Reliability is now one of the top priorities for the organisation as it ensures superior end-user experience which in turn creates trust and confidence. As a result, organisations can create their own brand identity and survive for longer period.
Organizations are now conducting continuous chaos testing to ensure resilience and understand how the system handles failure for better reliability. Overall, continuous chaos testing not only ensure resiliency, but it also assists to learn from the failures. Learning from failures also assist in reliability. In this blog, I will talk about how continuous resiliency testing assists application/product owners in steady learning from the failures.
Resiliency Testing- deliberately introducing failures:
Resilience is nothing but ability to regain or how quickly recover to its original state after experiencing sudden difficulties. In resilience testing, system’s ability to handle under extreme conditions, how quickly system can bounce back to its normal state and how gracefully system can recover from failures are observed thoroughly.
Overall objective is to deliberately introduce failures into the systems to test their response and discover flaws before they hit downtime well in advance. Executing those failure (chaotic) scenarios and observing the end-to-end system thoroughly to identify any potential errors before it affects the real user is the overall objective.
Continuous Resiliency Testing:
Resilience testing should be a continuous process that needs to be started from requirement gathering and should continue even at post deployment. Continuous resiliency testing ensures to find and fix any issues/errors/defects and resolve them before it goes to production. Overall, it can assist to reduce the downtime. Continuous resiliency testing is also highly cost-effective looking at its impacts both in terms of tangible and intangible perspective. In a nutshell, continuous resiliency testing assists to create a reliable system with enhanced end-user experience.
Continuous Resiliency Testing- finding & fixing from failures:
Finding the potential issues by continuous resiliency testing using planned and careful failure scenarios with continuous monitoring of the system, followed by fixing those issues and later re- executing to confirm the fix before system goes live, is a typical process that every organisation should follow. Findings from this continuous and repetitive resiliency testing with analysing, and fixing issues, if any, creates a reliable, fault-tolerant, resilient system which can handle unexpected real-life events.
Identifying failure scenarios (planned and thoughtful) are very critical to this process and should be finalised as a whole team approach by conducting several rounds of discussions based on system architecture, interfaces, dependencies etc. Typical examples of failure scenarios are:
Resource exhaustion Failures:
- CPU- Creates high Load on one or multiple CPU Cores
- Memory- Allocates different RAM Size
- I/O- Increases Reads/Writes on I/O Devices like Hard Disks
- Disk- Writes files to disk with specified percentage of filling
- User Load- Creates huge user loads beyond Stress
- Spike – Creates massive Transactional Volume
- Shutdown- Performs a shutdown on one or multiple nodes (Fail-over)
- Time Travel- Modifies the host’s system time
- Process Killer- Kills any specified process
- Network latency- Injects latency in network traffic
- DNS failure- Blocks access to DNS servers
- Blackhole- Drops all matching network traffic
- Packet Loss- Causes packet loss in network traffic.
By integrating resilience testing into CI/CD pipelines, organizations can build better resilient system which gives confidence for any unforeseen scenarios. Continuous resiliency testing should be conducted even on production with more controlled and effective way for better resilient, fault-tolerant, and reliable system with increased availability.
Continuous Resiliency testing: learning from failures:
Continuous resiliency testing assists in finding issues and resolving them well in advance, by executing thoughtful failure/chaotic scenarios. Continuous resiliency testing also assists to learn from failures while executing those chaotic scenarios. Typical steps that can be followed for failure injection:
- Create a hypothesis
- Inject failures
- Measure impact
- Verify hypothesis
- Learn from failures
And do this continuously.
Continuous resiliency testing tells whether a system is having a problem that require immediate attention or system is resilient under that failure/chaotic scenario. In a nutshell, learning from this failure, gives better understanding on the system’s resiliency and reliability. There is no doubt that seriously conducting continuous resiliency testing ensures high-quality product.
Continuous resiliency testing not only provides more application insights, but also assists to formulate the failure/ chaotic scenarios or blast radius for future rounds. Typical process involves:
- Plan the experiment /failures
- Execute with minimized blast radius
- Observe, Fix Issues, if any
- Increase blast radius for future rounds
- Learn from failures
In a complex system, failure can occur in any components and executing all permutations and combinations are impossible. When organizations do continuous resilience testing, it simply provides many details about their system. Learning from those test observations and analysis assist to identify the potential failure component(s). Overall, these learnings from failures can certainly add values to the organizations which in turn can assist them to endure in the market for longer duration.
Continuous Resiliency testing- Mandatory:
Outages can happen any time, but you need to prepare for that. Making continuous resiliency testing as a mandatory in overall software development life cycle will assist to prepare for the outages. First, continuously injecting failures (even randomly) in the system and its services in a Production like environment (can be Performance environment / Staging environment/ Pre-production environment) is necessary. Integrating resilience testing into CI/CD pipelines is better. In a nutshell, making continuous resiliency mandatory leading to a more resilient Production with reduced outages.
In a nutshell, continuous resilience testing assists to predict the failure well in advance by continuous failure injection, thoroughly monitor the whole system and resolve the issues, if any. Continuous resilience testing also assists in learning from failures to prepare for the worst and do business in long run.
- Using Reliability Testing For Finding And Fixing | EuroSTAR Huddle (eurostarsoftwaretesting.com) (My blog)
- Exploring the Site Reliability Engineering Role | EuroSTAR Huddle (eurostarsoftwaretesting.com) (My blog)
- Chaos Engineering on CI/CD Pipelines – The New Stack
- What is software resilience testing? Definition from WhatIs.com (techtarget.com)
- Towards continuous resilience. How to anticipate, monitor, respond… | by Adrian Hornsby | The Cloud Architect | Medium
Know Our Writer
ARUN KUMAR DUTTA
Associate Principal- Performance & Resilience Engineering
Arun has 16 years of managing end to end performance testing delivery experiences. He has been selected in multiple international testing conferences and global webinars. His multiple blogs have been published in different global testing forums and won various global awards. Currently working as Associate Principal- Performance & Resiliency Engineering in Larsen & Toubro Infotech Ltd. He also participated on Super Reads 2020 and his blog article also published by Synapse QA - Learn | Share | Grow.