Machine Learning & Quality Assurance

Lately, we’ve heard about Machine Learning everywhere, in every sector, a fitness tracker applications, an intelligent home assistant like Google Home/Amazon Alexa, what about an application recommending new stuff related to our behavior as a buyer, our streaming applications recommending music or movies based on our data.  

Let’s talk about basic concepts to understand better ML (I recommend the following book if you are interested in ML). Machine Learning is a subfield of computer science concerned with building algorithms to be useful and rely on a collection of examples of some phenomena without being explicitly programmed. These examples can come from nature, handcrafted by humans, or generated by another algorithm.

Machine learning approaches are divided depending on the learning types into the following categories/types.

Supervised Learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. The algorithm examines the training data and produces an inferred function, which can be used for mapping new examples.

Some traditional types of problems built on top of classification and regression include recommendation and time series prediction. Some well-known examples of supervised machine learning algorithms are Linear regression for regression problems.

Semi-supervised Learning

In semi-supervised learning, the dataset contains both labeled and unlabeled examples; the quantity of unlabeled examples is much higher than the number of labeled examples. The goal of a semi-supervised learning algorithm is the same as the supervised learning algorithm.

It seems that in semi-supervised learning, we add more uncertainty to the problem. Nevertheless, when you add unlabeled examples, you add more information about your problem; a large sample reflects better the probability distribution of the data we labeled came from. In theory, a learning algorithm can leverage this additional information.

Unsupervised Learning

An unsupervised learning algorithm aims to create a model that takes a feature vector x as input and either transforms it into another vector or into a value used to solve a practical problem.

In unsupervised learning, we deal with data that doesn’t have labels. The absence of labels representing our models’ desired behavior means the absence of a trustworthy reference judging its quality.

Reinforcement Learning 

Reinforcement learning is a subfield of Machine learning where the machine lives in an environment and can perceive the state of the environment as a vector of features. The purpose of a reinforcement learning algorithm is to learn a policy. A policy is a function (similar to a supervised learning model) that takes the feature vector of a state as input and outputs an optimal action to execute in that state.

Reinforcement learning solves a particular problem where decision making is sequential, and the goal is long term, such as game playing, robotics, resource management, or logistics.

Scikit-learn algorithm cheat-sheet
Credits: The hundred-page machine learning book – Machine Learning algorithm selection diagram for scikit-learn.

Machine Learning Testing

In my experience, a lot of Data Scientists and Data Engineers perform Model evaluation, covering some metrics, testing datasets, and summarizing performance. I don’t think it’s satisfactory; we won’t immediately understand the distinction if we change over to the new model. Additionally, we won’t track or prevent behavioral regressions for specific failure modes.

Machine Learning failing as image is displaying Roses out of stock and ML is recommending pepper
Credits: Image from twitter

Correlated to Machine Learning Testing, we must involve different types of checks to ensure learned logic will consistently produce our desired behavior. Many automated and functional tests can add value and enhance the overall quality of our ML models.

There is a difference between developing the path and Testing the path. Both must coexist; contrarily, it would be easy to end up off the track.


Data Unit test 

Data unit tests allow us to quantify model performance for specific cases in our data; this can help identify critical scenarios where prediction errors lead to critical error analysis. Tools like snorkel introduce slicing-functions that allow us to identify subsets of a dataset that meet specific criteria.

Credits: Janelle Shane –

Component integration test

Probably, if you know some Data Scientists or Data Engineers, the pain point here is the integration, being more end to end. By working end to end, Data Scientists will have the full context to identify the right problems and develop usable solutions.

Talking about integrating different services, Contract Testing can validate the model interface compatibility with the consuming application.

Model quality

For Model Quality, we can use Threshold Tests or ratcheting to ensure that new models don’t degrade against a known performance baseline (A straightforward way to pick a threshold is to take the median predicted values of the positive cases for a test set). Please keep in mind model performance is non-deterministic, but it is crucial to ensure our models don’t go above % error rate previously defined.

Model bias and fairness

It is also essential to check how the model performs against baselines for specific data sets. You might probably have an inherent bias in the training data where there are many more data points for a produced value of a feature (e.g., gender, age, or region). 

“Most current theory of machine learning rests on the crucial assumption that the distribution of training examples is identical to the distribution of test examples. Despite our need to make this assumption in order to obtain theoretical results, it is important to keep in mind that this assumption must often be violated in practice.”

Tim Mitchell

Compared to the original distribution in the real world, it’s imperative to check performance across different data slices. A tool like Facets can help to understand and analyzing machine learning datasets. Get a sense of the shape of each feature of your dataset using Facets Overview, or explore individual observations using Facets Dive.

Monitoring our Model

Growing an intelligent system to choose and improve ML models over time can also be seen as a meta-learning problem. Several of the states of the art research in this area is focused on these types of issues. Thus, the usage of reinforcement learning techniques, such as multi-arm bandits, or online learning in production, has been proposed. Furthermore, we must continue to evolve and adapt our experience and knowledge to provide better ML systems. 

Final Thoughts

Machine learning models are more complicated to test since we’re not explicitly programming the system’s logic. Nevertheless, Testing is still essential for high-quality software systems. These tests can give us a behavioral report of trained models, which can serve as a well-organized strategy towards error analysis.

Machine learning models also rely on a large amount of “traditional software development” to process data inputs, create feature representations, perform data increment, arrange model training, expose interfaces to external systems, among others.

Effective Testing for machine learning systems requires both traditional software testing and Machine Learning model testing; we must work directly with Data Scientist and Data Engineers to provide high-quality ML Models. Quality does not kill innovation; quality improves those innovations.

Thanks to Data Scientist Liliana Badillo, for sharing your knowledge about Machine Learning and encouraging me to write this post; I’d like to thank Nithin for reading early drafts and inviting me to contribute to this fantastic community.

Happy Bug Hunting,

Let’s keep learning, sharing, and growing together!


Behavioral Testing of NLP Models with CheckList

Leading Quality – Ronald Cummings-John & Owais Peer

Artificial Intelligence: A Modern Approach, Stuart J. Russell, Peter Norvig

The hundred-page machine learning book – Andriy Burkov

Open AI Blog Adversarial examples

Leave a Reply

Up ↑

%d bloggers like this: