Cloud Observability Frameworks for Non-Functional Testing

Cloud Observability Frameworks for Non-Functional Testing

Cloud Observability Frameworks for Non Functional Testing

Share This Story

In today’s complex, hybrid, constantly evolving cloud infrastructures, with microservices, serverless and automation as code, it is critical to be able to easily observe and know what is happening, where it is happening, and why it is happening. Observability is a specific use case for dev and SREs needing to perform production debugging from high fidelity collected telemetry data. 

Observability enables to detect and investigate problems in a complex system without having to re-instrument or build new code. Adding new code or new instrumentation isn’t wrong, but the system isn’t appropriately observable until we don’t have to add more to figure out anything we need to know about it. Thus, Observability is the practice of achieving actionable insights from data that is generated by instrumented IT and software systems. The goal is to understand both when an event or issue happened, and why.

Full Stack Observability

Application performance monitoring tools addresses the challenge of efficiently pinpointing the root cause of an issue in such complex systems. With APM, monitoring data is collected and displayed through customizable dashboards to provide information about when and where an issue may have occurred. APM Tools like New Relic, Dyna trace. App Dynamics,  integrates with Cloud services (Both compute and Serverless ) for full-stack, real-time monitoring and AI-driven analytics and  enables Application Performance Monitoring, Infrastructure Monitoring, Digital Experience Monitoring, Serverless Monitoring, Log Management, AIOps, Alerts


Full Stack or Enterprise Observability can thus be defined as the ability to monitor the entire IT stack, from customer-facing applications right down to the cloud infrastructure and core network with metrics, traces, span logs, and analytics in Realtime.

Implementing Observability – Instrumentation and Telemetry

To achieve observability, we need to instrument everything and collect, ingest and view all the telemetry data in one place. It can be achieved by building own tools, using open-source software like Open Telemetry or use a commercial observability solution like APM Tools


  • – Instrumentation: These are measuring tools that collect telemetry data from a container, service, application, host, and any other component of your system, enabling visibility across your entire infrastructure.
  • – Data ingestion and correlation: The Telemetry data collected from across your system is processed and correlated by the backend, which creates context and enables automated or custom data curation for time series visualizations.

Open Telemetry is an open-source observability framework standard for generating, capturing, and collecting telemetry data for cloud-native software. It is a set of APIs, SDKs, tooling, and integrations that are designed for instrumentation- creation and management of telemetry data such as traces, metrics, and logs.  

Manual Instrumentation  


  1. 1. Import the OpenTelemetry API and SDK to the service code
  3. 2. Configure the OpenTelemetry API and SDK based on language. 
  5. 3. Create and Export Telemetry Data- create traces and metric events through the tracer and meter objects.
  7. OpenTelemetry supports a wire protocol known as OTLP, which is supported by all OpenTelemetry SDKs. This protocol can be used to send data to the OpenTelemetry Collector

Auto instrumentation has become a standard for monitoring, maintaining, and measuring the performance of modern, more complex applications. Instrumenting applications automatically eliminates the need for adding lines of code manually to send trace data. To enable it, language specific dependencies need to be added to the code 

Observability Tools

Observability Tools

Observability Engineering Strategy for AWS

Strategy for implementing Observability into NFT services can be classified into 3 approaches. We shall look into each of these approaches in detail.

Observability Engineering Strategy for AWS​

1. E2E Observability Approach using custom Services

Fig: Observability using custom services (Designed using Lucid Chart)

This approach is used to analyse performance metrics and performance issues using multiple custom tools and services ( Non APM Tools) . The key functionalities of this approach include:


  • – Proactive Monitoring
  • – Metrics Collection and Retrieval
  • – Processing and correlation
  • – Visualization and alerting


The Proactive Monitoring tools are used to do Synthetic Monitoring, Uptime Monitoring whereas the Metrics, Logs and Traces are collected using custom services through the open telemetry agents installed in the cloud components. The metrics are visualized through Grafana, and pager duty is integrated for alerts.

2. Enterprise Observability Approach using APM

Fig: Observability using APM Tools (Designed using Lucid Chart)

This approach is used to analyse performance metrics and issues using APM Tools. The high level steps include:


  • – Instrument the Application service with Open Telemetry using APIS and SDKs. Open Telemetry allows both manual and auto instrumentation based on the language. 
  • – Export the telemetry data to APM through OTLP Exporter. 
  • – View the data in APM by receiving the data through OTLP endpoint
  • – The APM Tools can also be configured receive telemetry data from external integrations like Prometheus or Cloud integrations with AWS, Azure and GCP. 

3. Observability Approach using AWS Services

This approach  involves creating a Observability pipeline using AWS Managed Services. It consists of four elements:


  1. 1. Receivers
  2. 2. Processors
  3. 3. Exporters
  4. 4. Service
  • – Receivers will be getting telemetry data – application and platform metrics from services running on ECS with the AWS Distro for OpenTelemetry collector.
  • – Exporters will forward the data to one or multiple destinations. Awsprometheusremotewrite exporter sends the metrics to AMP workspace
  • – The metrics collected in a workspace within Amazon Managed Service for Prometheus can be visualized using Grafana
Fig: Observability using AWS Services (Designed using Lucid Chart)

Now we shall look into other methods of implementing observability for the different non-functional testing types. 

Synthetic Monitoring for Performance Testing

Synthetic monitoring is an approach to monitor the performance of the applications by simulating users across the business user flows and APIS of the applications. Synthetic monitoring provides performance metrics related to the uptime and performance of the critical business transactions and API’s. 


Synthetic Monitoring scripts for the critical business flows using UI recorder or selenium and the scripts are executed across the applications at specified timings.

Fig: Synthetic Monitoring for Performance

Sample Metrics 


  • – Average Duration by Region

  • – Average Duration by Monitor ID

  • – IP Duration

  • – Domain Duration

  • – Request Time Breakdown (Durations)

  • – Failures by Monitor Name

  • – Failures by Error Message

  • – Failures by Monitor Type


Note: Duration is the response time of the application

Observability for Resilience Testing

Resilience Testing is a form of non-functional testing to test the system for resiliency by introducing failures and ensure that the system recovers fully. Observability is a key factor for Resiliency Testing to observe the System under chaos. 


In the sample approach illustrated below Jenkins is used to trigger the Jmeter performance test executions using AWS Spot instances as load generators and it explains how APM Tools, Grafana and CloudWatch can be used for Observability in Resilience Testing.

  • – Platform: AWS
  • – Target: Microservice deployed in ECS Cluster
  • – Observability Tool: New Relic APM
  • – Chaos Tool: AWS Fault Injection Simulator & AWS Systems Manager
  • – Load Generator: Apache Jmeter and AWS Spot instances 
  • – Attack Types: Infrastructure, Application (Resource /Network) and Database
  • – Blast Radius: 50% (2 out of 4 containers)
  • – Duration: 15 Minutes
Fig: Observability for Resilience Testing (Designed using Lucid Chart)
  • Chaos Observability Metrics to be collected

    1. 1. MTTR (Mean time to Recover)
    2. 2. MTTD (Mean time to Detect)
    3. 3. Turnaround Time (Autoscaling Time)
    4. 4. Traffic (Volume)
    5. 5. Latency (Response Times)
    6. 6. Error rate
    7. 7. Saturation
    •    – CPU Usage
    •    – Memory Usage
    •    – Network IN/OUT

Observability for Customer Experience Testing

Customer Experience testing focuses more on analysing the performance of the application from end user’s point of view by collecting performance metrics from the users browsers itself. The model uses APM Tools, Graphite and Grafana for Observability.


When it comes to real user experience, browser metrics (also known as RUM – Real User Monitoring) gives insights on how the end user is experiencing the application when he visits a page. These are to be contrasted with synthetic APM measurements as RUM metrics collects a website or app’s performance straight from the browser of the end user and from server side. These RUM Metrics are used for profiling the web application’s front-end performance.

In the sample customer experience testing approach illustrated below Jenkins and sitespeed are used to trigger the Performance test executions. The Client side performance data is sent to Graphite through sitespeed and can be visualized in Grafana. 

Fig: Observability for Customer Experience Testing (Designed using Lucid Chart)

RUM Metrics to be collected


  1. 1. Last Visual Change
  2. 2. Fully Loaded Time
  3. 3. First Contentful Paint
  4. 4. Largest Contentful Paint
  5. 5. Total Blocking Time
  6. 6. Max Potential FID
  7. 7. Speed Index

Know our Super Writer:

Kavin Arvind Ragavan

cloud architect - performance

Cloud Performance Architect with over a decade of experience in Non- Functional testing with strong expertise in Performance Testing & Engineering, Chaos Engineering and Site Reliability Engineering in Resiliency & Observability Areas. Specialized in AWS & GCP Cloud Performance testing and in designing & implementing Cloud Test frameworks for Performance, Resiliency and Observability. Has involved in creation of automated Performance & Resilience Engineering frameworks and implementing Continuous integration & Continuous delivery to perform early performance ,resilience and accessibility testing and identify potential performance bottlenecks during the development phase. Has presented 4 Whitepapers related to Cloud Performance Testing, Chaos Engineering and Microservices at Software Conferences

Read Next

About Us

SYNAPSE QA is a community-driven space that aims to foster, support and inspire talents in tech industry(mainly QAs) by bridging the gaps between knowns and unknowns!


Subscribe To Our Newsletter To Get The Latest Updates

Subscription Form

Up ↑