In today’s digital technology era where downtime translates to shut down, it is imperative to build resilient cloud structures. For example, in the pandemic, IT maintenance teams can no longer be on-premises to reboot any server in datacenter. This may lead to a big hindrance in accessing all the data or software, putting a halt on productivity, and creating overall business loss if the on-premises hardware is down. However, the solution here would be to transmit all your IT operations to cloud infrastructure that ensures security by rendering a 24/7 round-the-clock tech support by remote members. Cloud, essentially, poses as a savior here.

Recently, companies have been fully utilizing the cloud potency and hence, observability and resilience of cloud operations become imperative as downtime now equates to disconnection and business loss.

Imagining a cloud failure in today’s technology driven business economy would be disastrous. Any faults and disruption will lead to a domino effect, hampering the company’s system performances. Hence, it becomes essential for organizations and companies to build in resilience into their cloud structures though chaotic and systematic testing. In this blog, I will take you through What Resilience and Observability means?, Why resilience and chaos testing vital to avoid downtimes”.

To avoid cloud failure, enterprises must build resilience into their cloud architecture by testing it in continuous and chaotic ways.

1.) Observability

Observability can be understood through two lenses. One would be through Control Theory, that explains observability as the process of understanding the state of a system through the inference of its external outputs. Another lens explains the discipline and the approach of observability as being built to gauge uncertainties and unknowns.

It helps to understand the property of a system or an application. Observability for cloud computing is a prerequisite that leverages end-to-end monitoring across various domains, scales, and services. Observability shouldn’t be confused with monitoring, as monitoring is used to understand the root cause of problems and anomalies in applications. Monitoring tells you when something goes wrong whereas observability helps you why it went wrong. They each serve a different purpose but certainly complement one another.

Observability along with resilience are needed for cloud systems to ensure less downtime, faster velocity of applications etc.

2.) Resilience

Every enterprise migrating to cloud infrastructure should ensure and test their systems for stability, reliability, availability, and resilience; with resilience being at the top of the hierarchy. Stability is to ensure that the systems and servers do not crash often; availability ensures system uptime by distributing applications across different locations to ease the workload; reliability ensures cloud systems efficient functioning and availability. But, if the enterprise wants to tackle unforeseen problems, then constantly testing Resilience becomes indispensable.

Resilience is the expectation that something will go wrong, and that the system is tested in a way to address and maneuver itself to tackle that problem. Resilience of a system isn’t automatically achieved. A resilient system acknowledges complex systems and problems and works to progressively take steps to counter errors. It requires constant testing to reduce the impact of a problem or a failure. Continuous testing avoids cloud failure assuring higher performance and efficiency.

Resilience can be achieved through Site Resilient Design and leveraging systematic testing approaches like chaos testing etc.

What is conventional testing and why is it not enough?

Conventional testing ensures a seamless setting up and migration of applications into cloud systems and additionally monitors that they perform and work efficiently. This is adequate to ensure that the cloud system does not change application performance and functions in accordance with design considerations.

Conventional testing doesn’t suffice as it is inefficient in uncovering underlying hidden architectural issues and anomalies. Some of the faults appear dormant as they only become visible when specific conditions are triggered.

High availability promises of cloud

“We see a faster rate of evolution in the digital space. Cloud lets us scale up at the pace of Moore’s Law, but also scale out rapidly and use less infrastructure” says Scott Guthrie on the future and high promises of cloud. Owing to the pandemic and everyone being forced to work from home, there has not just been a surge in cloud investments. But, due to this unprecedented demand, all hyperscalers had to bring in throttling and prioritization controls, which is against the on-demand elasticity principle of Public Cloud.

Public cloud isn’t invincible when it comes to outages and downtime. For example, the recent Google Outage that halted multiple google services like Gmail and Youtube, showcases how public cloud isn’t necessarily free of system downtimes either. Hence, I would say the pandemic has added a couple of additional perspectives to resilient cloud systems:

  1. System must operate smoothly and be unaltered even when they receive an unexpected surge in online traffic
  2. System must look for alternate ways to manage the functionality and resource pool in case additional resource allocation requests are declined or throttled by the Cloud provider.
  3. System should be accessible and secure to handle unknown locations and shift to hybrid work environments (may be a number of endpoints coming outside the network firewall).

The pandemic has highlighted the value of continuous and chaotic testing of even resilient cloud systems. A resilient and thoroughly tested system will be able to manage that extra congested traffic in a secure, seamless, and stable way. In order to detect the unknowns, chaos testing and chaos engineering is needed.

Why just Cloud Native Application design alone is not sufficient for Resiliency

In the public cloud world, architecting for application resiliency is more critical due to the gaps in base capabilities provided by cloud providers, multi-tier/multiple technology infrastructure and distributed nature of cloud systems. This can cause cloud applications to fail in unpredictable ways even though the underlying infrastructure availability and resiliency is provided by the cloud provider.

To establish a good base for application resiliency, the cloud engineers during design should adopt the following strategies to test, evaluate and characterize application layer resilience:

  1. Leverage Well Architected Framework for overall Solution Architecture and adopt the Cloud native capabilities for availability and disaster recovery.
  2. Collaborate with Cloud architects and Technology architects to define availability goals and derive application and database layer resilience attributes. ● Along with Threat modelling, define hypothetical failure modes based on expected or observed usage patterns and establish a testing plan for these failure modes based on business impact.

By adopting an architecture-driven testing approach, organizations can gain insights into the base level of cloud application resiliency, well before going live and allot sufficient time for performance remediation activities. But you still would need to test the application for unknown failure and aspects of multiple failure points in cloud native application design.

Chaos Testing and Engineering

Chaos testing is an approach that intentionally induces stress and anomalies into the cloud structure to systematically test the resilience of the system.

Firstly, let me make it clear that Chaos Testing is not a replacement to actual testing systems. It’s just another way to gauge errors. By introducing degradations to the system, IT teams can see what happens and how it reacts. But, most importantly it helps them to gauge the gaps in the observability and resilience of the system. That is, the things that went under the radar initially.

This robust testing approach was first emulated by Netflix during their migration to cloud systems back in 2011, and since then, it has effectively established this method. Chaos testing brings to light inefficiencies and ushers the development team to change, measure and improve resilience and helps cloud architects to better understand and change their design.

Constant, systematic, and chaotic testing increases the resilience of cloud infrastructure which effectively enhances the systems resilience, ultimately boosting the confidence of managerial and operational teams in the systems that they’re building.

A resilient enterprise must create resilient IT systems partly or entirely on cloud infrastructure.

Using chaos and site reliability engineering helps enterprise to be resilient across:

a.) Cloud and infrastructure resilience

b.) Data resilience via continuous monitoring.

c.) User and Customer experience resilience by ensuring user interfaces hold up under high stress conditions

d.) Resilient cybersecurity by integrating security with governance and control mechanisms.

e.) Resilient support for Infra, App and Data

To establish complete application resiliency, in addition to earlier mentioned Cloud Application design aspects; Solution Architect needs to:

  • Adopt architecture patterns that allow you to inject specific faults to trigger internal errors which simulate failures during the development and testing phase.

Some of the common examples of fault triggers are delay in response, resource hogging, network outages, transient conditions, extreme actions by users and many more.

  1. Plan for continuous monitoring, management and automate the incident response for common identified scenarios
  2. Establish Chaos testing framework and environment
  3. Inject faults with varying severity and combination and monitor application layer behavior.
  4. Identify anomalous behavior and iterate the above steps to confirm criticality.

How To Perform The Chaos Test?

Chaos testing can be done by introducing an anomaly into any seven layers of the cloud structure that helps you to assess the impact on resilience.

When Netflix successfully announced its resiliency tool – Chaos Monkey in 2011, many developing teams adopted it for chaos engineering test systems. There’s another tool test system developed by software engineers called Gremlin, that essentially does the same thing. But, if you’re looking to perform a chaos test taking the current context of

COVID-19, you can do so by using Gameday. This stimulates an anomaly wherein there’s a sudden increase in traffic induced; for example, customers accessing a mobile application at the same time. The goal of GameDay is to not just test the resilience but also enhance the reliability of the system.

The steps you need to take to ensure a successful Chaos Testing are the following:

  1. Identify: Identify key weaknesses within your system and create a hypothesis along with an expected outcome. Engineers need to identify and assess what kind of failures to inject within the hypothesis framework.
  2. Simulate: Inject anomalies during production based on real-life events. This ensures that you include situations that may happen within your systems. This could entail an application or network disruption or node failure.
  3. Automate: You need to automate these experiments – which could be every hour/every week etc. This ensures continuity, a detrimental factor in chaotic engineering.
  4. Continuous Feedback & Refine: There are two outcomes to your experiment. It could either assure resilience or detect a problem that needs to be solved. Both are good results from which you can take feedback to refine your system.

Other specific ways to induce a faulty attack and sequence on the system could be:

  1. Adding network latency
  2. Cutting off scheduled tasks
  3. Cutting off microservices
  4. Disconnecting system from the datacenter


In today’s digital age where cloud transition and cloud usage is surging, it becomes imperative to enhance cloud resilience for effective performance of your applications. Continuous and systematic testing is imperative in the life cycle of a project, but also, to ensure cloud resiliency at a time where even public cloud is over-burdened. By preventing lengthy outages and future disruptions, businesses save significant costs, goodwill and additionally, assure service durability for customers. Chaos engineering, hence, becomes a must for large scale distributed systems.