Testing Data Lake

We all have heard about data lake and its importance, but we rarely talk about a testing results data lake and how it can help us. Companies have always struggled with not having a single place to hold all testing results & related data. All the data related to testing have been scattered, hard to consolidate, move around & have posed to be one of the biggest barriers for testing to succeed.

We have seen in the companies where for instance automation testing team has its way of storing, managing, and maintaining their results data, similarly, performance testing, security testing & other types of testing have their own ways. This trend is seen across teams, companies & industry.

To give an example, on the performance testing & engineering, the load testing tools would have its own way of storing the data, profiling & analysis tools would have its own way of storing. Due to this, it’s highly challenging to port, move around the data from one store to another. There isn’t any simpler way to connect these different data stores to provide flexibility and power to the testing team, a community across the organizations.

Data is the new oil today, and the benefits are equally applicable for the testing domain as well.

A data lake is a centralized repository that can be created and used to store all the structured and unstructured data. A testing data lake can be created by pushing all the various test results & related data into a centralized repository that can store the data, for example, AWS S3 which is an object storage service can be used as a data lake.

How testing data lake can be created

Let us consider a company that has a testing team named “TCOE”. “TCOE” has been conducting automation testing, performance testing, exploratory testing, security testing, accessibility testing across their products.

“TCOE” after they execute performance test runs, they would push all test results and data into a data lake they have created. All the data can be stored as a parquet file or XML file or JSON file or any other format before it gets pushed to a data lake. The files could store response times, hits per second, errors, CPU, memory utilization as XML / Parquet files and push it to a data lake.

Automation testing results like automation script pass/fail, validations, screenshots, errors & other details can be pushed to the data lake as well. Similarly, other types of testing related data can be pushed to the data lake.

One can choose the technology and service on how to store, construct & push to data lake based on their requirements. Similarly, we can choose technology, service & cloud / on-premises for building the data lake based on the requirements

There are plethora of options to choose from and build the entire ecosystem.

data lake illustration
Data Lake ecosystem

Benefits of building a data lake:

  1. Live feed of the overall quality of the product. If a product is getting ready for the release, a live running dashboard built out of the data lake can provide every second of the overall quality, how a product is doing with performance, security, accessibility, and other testing. If there are 50 different products, a data lake can consolidate the overall quality of the company, break down by product, break down by testing types. The live dashboard can be so flexible that it can meet the needs of different teams and people across the organization. For example, a CEO wanting to get a holistic view of the quality of all products to an engineer wanting to specifically look at a defect detail.
  2. Tools, framework, reports that we use across various testing (performance, automation, security, and others) are all different. Companies have always struggled to consolidate results, reports, and provide a unique single view of the quality. Data lake can help address this challenge.
  3. Data lake can provide complete flexibility to build any kind of visualization, dashboards reports, metrics.
  4. Perform predictive analysis, real-time analytics, etc.
  5. Build AI solutions & ML models to make better decisions.
  6. Foster innovation around testing, quality.
  7. Drive product overall success.

How testing teams can reap the benefits of a data lake

  • Automatically provide details on where the focus should and should not be for testing the current release and what kinds of tests are beneficial.
  • Provide inputs into what tests / other reasons are missing to improve the quality of a product.
  • If the quality across products are deteriorating, it can automatically point why, when, how, and where.
  • Can help to quickly identify duplicate defects, issues.
  • It can learn, create vital trends for products success.
  • Predictive analysis like how a performance defect/issue can impact the whole system and quality.
  • Enhance test coverage.
  • Identify what are the critical areas of focus for testing based on the historical data.
  • Analyze the impact of defects and how it can impact product and customer success.
  • Early detection of failures in the product & help to formulate preventive actions.
  • Predict the probability of finding defects in a feature of a product & risks for each feature delivered.
  • Help drive, optimize all testing efforts also, improving the testing process.
  • When changes are made to a module, can highlight the impact and areas of impact.
  • Improvement & recommendations for the test data to be used and what would likely contribute to efficient testing.
  • Improve the regression suite by identifying the efficient business use cases scenarios to be included & tested.
  • Automatically create test cases based of patterns.
  • Create a risk-based testing strategy.
  • Identify, provide insights into test suites efficiency and help improvise them
  • Save effort & cost with the overall testing over a period
  • Identify, remove duplicate and redundant test cases and efforts

The whole idea of a testing data lake is to push all test results & test-related data into a data lake. The data could be across:

  • Multiple products
  • Multiple test types (for example performance testing, automation testing, security testing, functional testing, and other testing types)
  • It could be everything & anything related to testing across the company
data lake

The above image depicts how a company could build a testing data lake. Once the data lake is built, the sky is the limit on what we would want to do.

Future of testing data lakes

Companies can also look at building a data lake from the production data. The production data lake and testing data lake can be integrated to build a robust solution that can provide insight into a lot of things.Some of the advantages are:

  • What is being tested in R&D Vs what is being tested by real end-users in production? This would help testing teams identify where the gaps are with testing and how to fix them
  • Realistic cross-browser, platform testing requirements can be identified
  • Most heavily used business-critical scenarios & other performance testing requirements can be derived
  • Accessibility, Automation, functional, mobile testing requirements can be derived and made better
  • ML / AI solutions results can be built to compare testing and production data lake to analyze the predictions and better the models
  • Learn from a Production data lake
  • Improvise Shift left testing
  • Improve product success & customer experience
  • Higher Product quality

The list of other advantages can go on, there are plenty of them.

We could have an open-source testing data lake where companies can come subscribe, retrieve, and make use of these data for the larger benefit of the software industry. We could have an open-source data model where APIs can be exposed to companies across the industry to contribute and leverage equally.

Industry Testing Data Lake

The above image depicts how we could have an open-sourced testing data lake where companies across the industry can contribute and leverage. Having an open-source data lake has benefits like identifying similar pattern & kind of issues, helping each other. The testing data lake can be created using different technologies, one such example is using AWS S3 to create a data lake and using other AWS services like Kinesis, Athena, Quick sight to build dashboard and analytics.

Predictive analytics and machine learning can be done using AWS Deep learning and sage maker.

The other way is to use sumologic, Jenkins, Grafana, and a combination of other tools. There are many other ways you could build this. Teams need to figure out what is the best way to build this based on requirements and needs of the organization.

Building and owning a Testing data lake is going to be very important for companies to succeed and hence companies need to start thinking and laying out the building blocks to have this in place. There are a lot of advantages and benefits to building this for testing teams across the industry, the list is infinite.


About the Author:

Mahesh M | Senior Software Development Manager

I have never run away from challenges and have always embraced challenges with open arms since I believe challenges are what can shape a person and provide opportunities to grow. My journey spanning many years has taken me across various companies like Accenture, Sony, Ellucian providing me an opportunity to learn, contribute, and grow. I have played various Leadership roles in many large-scale engagements across various companies. I have been involved in leading various testing operations and engagements.

Have Managed Challenging and Critical engagements which demanded building, managing, and nurturing high performance, result-oriented teams to support delivery and contribute towards the success of the organization. I have always enjoyed various opportunities, challenges, successes, failures that were thrown at me in my career and have received them equally. Each of them has taught me one or the other lesson and made me only stronger.

Leave a Reply

Up ↑

%d bloggers like this: