In today’s complex, hybrid, constantly evolving cloud infrastructures, with microservices, serverless and automation as code, it is critical to be able to easily observe and know what is happening, where it is happening, and why it is happening. Observability is a specific use case for dev and SREs needing to perform production debugging from high fidelity collected telemetry data.
Observability enables to detect and investigate problems in a complex system without having to re-instrument or build new code. Adding new code or new instrumentation isn’t wrong, but the system isn’t appropriately observable until we don’t have to add more to figure out anything we need to know about it. Thus, Observability is the practice of achieving actionable insights from data that is generated by instrumented IT and software systems. The goal is to understand both when an event or issue happened, and why.
Full Stack or Enterprise Observability can thus be defined as the ability to monitor the entire IT stack, from customer-facing applications right down to the cloud infrastructure and core network with metrics, traces, span logs, and analytics in Realtime.
To achieve observability, we need to instrument everything and collect, ingest and view all the telemetry data in one place. It can be achieved by building own tools, using open-source software like Open Telemetry or use a commercial observability solution like APM Tools
Open Telemetry is an open-source observability framework standard for generating, capturing, and collecting telemetry data for cloud-native software. It is a set of APIs, SDKs, tooling, and integrations that are designed for instrumentation- creation and management of telemetry data such as traces, metrics, and logs.
Manual Instrumentation
Auto instrumentation has become a standard for monitoring, maintaining, and measuring the performance of modern, more complex applications. Instrumenting applications automatically eliminates the need for adding lines of code manually to send trace data. To enable it, language specific dependencies need to be added to the code
Strategy for implementing Observability into NFT services can be classified into 3 approaches. We shall look into each of these approaches in detail.
This approach is used to analyse performance metrics and performance issues using multiple custom tools and services ( Non APM Tools) . The key functionalities of this approach include:
The Proactive Monitoring tools are used to do Synthetic Monitoring, Uptime Monitoring whereas the Metrics, Logs and Traces are collected using custom services through the open telemetry agents installed in the cloud components. The metrics are visualized through Grafana, and pager duty is integrated for alerts.
This approach is used to analyse performance metrics and issues using APM Tools. The high level steps include:
This approach involves creating a Observability pipeline using AWS Managed Services. It consists of four elements:
Now we shall look into other methods of implementing observability for the different non-functional testing types.
Synthetic monitoring is an approach to monitor the performance of the applications by simulating users across the business user flows and APIS of the applications. Synthetic monitoring provides performance metrics related to the uptime and performance of the critical business transactions and API’s.
Synthetic Monitoring scripts for the critical business flows using UI recorder or selenium and the scripts are executed across the applications at specified timings.
Note: Duration is the response time of the application
Resilience Testing is a form of non-functional testing to test the system for resiliency by introducing failures and ensure that the system recovers fully. Observability is a key factor for Resiliency Testing to observe the System under chaos.
In the sample approach illustrated below Jenkins is used to trigger the Jmeter performance test executions using AWS Spot instances as load generators and it explains how APM Tools, Grafana and CloudWatch can be used for Observability in Resilience Testing.
Chaos Observability Metrics to be collected
Customer Experience testing focuses more on analysing the performance of the application from end user’s point of view by collecting performance metrics from the users browsers itself. The model uses APM Tools, Graphite and Grafana for Observability.
When it comes to real user experience, browser metrics (also known as RUM – Real User Monitoring) gives insights on how the end user is experiencing the application when he visits a page. These are to be contrasted with synthetic APM measurements as RUM metrics collects a website or app’s performance straight from the browser of the end user and from server side. These RUM Metrics are used for profiling the web application’s front-end performance.
In the sample customer experience testing approach illustrated below Jenkins and sitespeed are used to trigger the Performance test executions. The Client side performance data is sent to Graphite through sitespeed and can be visualized in Grafana.
RUM Metrics to be collected
Cloud Performance Architect with over a decade of experience in Non- Functional testing with strong expertise in Performance Testing & Engineering, Chaos Engineering and Site Reliability Engineering in Resiliency & Observability Areas. Specialized in AWS & GCP Cloud Performance testing and in designing & implementing Cloud Test frameworks for Performance, Resiliency and Observability. Has involved in creation of automated Performance & Resilience Engineering frameworks and implementing Continuous integration & Continuous delivery to perform early performance ,resilience and accessibility testing and identify potential performance bottlenecks during the development phase. Has presented 4 Whitepapers related to Cloud Performance Testing, Chaos Engineering and Microservices at Software Conferences