Craig Risi
- Nov 25, 2020
- 7 min read

Software Testing in an AI-driven world – Part 2 – Testing AI systems

So, in my last article, I introduced you to the most popular branches of artificial intelligence, but now we need to know how to test them. As mentioned in the previous article, each of these different methods will require different types of software design and testing approaches, though we can still pull together some basic criteria that will apply to all forms of AI systems.

But I guess before we even go into more detail on the test aspects, we need to also look at why AI systems need testing. Yes, they are just code like any other form of software but surely with computers learning and adapting to make decisions should be able to keep itself in check? Well, if you have been following a lot of the developments in AI lately you will know that AI or any form of machine learning is only as effective as the data we feed it and the decisions we let it make, so even the best AI will make the wrong decision if we allow it too.

Secondly, even if we are looking to only test the AI once it is already making decisions we rely on, it’s too late. We need to ensure that before we allow various forms of AI into different systems that it will be capable of doing the job we need it to and so to do this effectively, we need to test it properly. Similarly, as our different algorithms and AI approaches work, we need to also ensure that the rest of our systems we have that interact with these different forms of AI all integrate with each.

So, what approach should we take to testing AI systems? While there is no one way, the below point should help you get a good handle of what needs to be done to test your AI system:

Identifying the exact Use Cases

What is the purpose of an AI system and what are its decision trees? While the data we could feed learning systems appears quite endless, the truth is that if we understand its learning processes, we can still test the correct outputs. Ensuring we have the right sample set data, comparison images and that it is producing a predicted level of outputs, even a much smaller subset and the degree to how it would work in production. AI might be making decisions based on data, but the processes of how it should reach those decisions are decipherable and we should be focusing on the required inputs and outputs to achieve these specific use cases.

Trying Out the Algorithm

Once the network has been busy optimizing for some time, you will want to check how well it’s doing with its newly learned formulas. Your training algorithm already outputs how well it’s doing on the training examples, meaning the data you’ve been feeding it all this time. However, using this number to evaluate the algorithm is not a good idea.

Chances are the network will detect cancer correctly in the images it’s seen many times, but that’s no indicator of how it will perform on other images like the ones it will see in production. Your cancer detection algorithm will only get one chance to assess each image it sees, and it needs to predict cancer reliably based on that.

So, the real question is, how does the algorithm perform when presented with completely new data that it hasn’t been trained on?

This new data set is called the development set because you tweak your neural network model based on how well the trained network performs on this set. Simply put, if the network performs well on both the training set and the development set (which consists of images it isn’t optimized for because they were not part of the training set), that’s a good indicator that it will also do well on the images it will face day to day in production.

If it performs worse on the development set, your network model needs some fine-tuning, followed by some more training using the training set and, finally, an evaluation of the new, hopefully, improved performance using the development set. Often you will also train several different networks and decide which one to use in your released product based on the models’ performances on the development set.

Choosing Dev and Test Data Sets

At this point you might ask yourself, isn’t that testing? Well, not really.

Feeding the development set into your neural network can be compared to a developer trying out the new features they’ve built on their machine to see if they seem to work. To thoroughly test a feature, though, a fresh pair of eyes—most commonly a test engineer—is required to avoid biases. Similarly, you’ll want to use a fresh, never-used data set to verify the performance of your machine learning system, as these systems become biased as well.

How does a computer become biased? As described above, during development you tweak your model based on the results it gets on the development set, so by definition, you will choose the model that works best with this specific data set. For our cancer detection example, if the development set coincidentally consisted mostly of images showing earlier stages of cancer and healthy patients, the network would have troubles dealing with images showing later stages of cancer, because you chose the network model that doesn’t perform best for those circumstances.

Of course, you should try to use well-balanced training and development sets, but you won’t really know if you managed to do that without using a completely new data set to test the final algorithm. The network’s performance on the test set is the most reliable indicator of how it will perform out in the real world.

For that reason, it’s important to choose a test set that resembles the data your AI will receive in production as closely as possible. For the cancer detection algorithm, that means choosing a variety of images of different qualities, with different sections of the body, from different patients. These images have to be labelled as correctly as possible as cancerous or not cancerous. Now, for the test, you simply have to let the algorithm assess all the test examples and compare the algorithm’s output to the expected output. If the percentage of correctly assessed images is satisfying, the test is successful.

Defining Requirements

Those of you who are experienced testers will certainly ask, what does “satisfying” mean in terms of those results? In traditional testing, the answer is usually quite clear: The output should be correct for all test cases. However, this will hardly be possible when it comes to machine learning algorithms, especially for complex problems such as cancer detection. So to come up with a concrete number, the best place to start is to look at how qualified humans perform at that exact task.

For our cancer detection example, you’ll want to assess the performance of trained doctors—or, if you want to aim even higher, of a team of world-renowned experts—and use that as your goal. If your AI detects cancer as well or better than that, we can consider the test results satisfactory.

Managing Risk

Up until now, we’ve been talking about the percentage of correctly assessed images as the metric to look at in the test results. In other words, you’d evaluate your deep-learning algorithm based on how many healthy patients were diagnosed as cancerous and how many ill patients as healthy. However, these two things are not the same in the real world.

If the AI decides that a healthy patient has cancer, more tests will be performed, and the patient will eventually be sent home if the other tests don’t indicate any problems. Apart from a major health scare, all will be well. If, on the other hand, a patient who indeed has cancer is sent home based on an incorrect assessment, they will lose invaluable time to start their treatment. Their chances of being cured might be much worse when the cancer is finally detected than they would have been had the algorithm assessed their X-ray correctly in the first place.

For that reason, you will need to decide which weight to place on false positives and false negatives. Similar to risk-based testing of non-AI tools, the decision on whether to release your product in its current state even though some test cases might fail depends on the risk associated with the failing test case. Sending a healthy patient in for more tests is low risk; sending a sick patient home is a potentially deadly risk.

Ruling Out Data Biases

Another important part of testing deep-learning systems is bias testing. Because neural networks base their decisions strictly on the data they are trained on, they run a risk of mimicking biases we see when humans make decisions since these biases are often reflected in data sets that were collected.

Let’s go back to our cancer detection example. When doctors assess X-ray images, they also know the patient’s history, so they might unconsciously pay more attention to a lifelong smoker’s image than to a young, non-smoking patient’s, so they might therefore be more likely to miss lung cancer in the latter patient’s X-ray.

If you use the doctor’s diagnosis to label the expected results for your data set, this bias will likely be transferred to your algorithm. Even though the network won’t get any additional information about the patient, lungs of smokers and non-smokers certainly have differences, so the network might link the look of a non-smoker’s lung to a negative cancer test result and fail to detect cancer in these images.

To rule out biases in neural networks, you’ll need to carefully analyze the test results—especially the failures—and try to find patterns. For example, you could compare the algorithm’s success rate for smokers’ and non-smokers’ images. If there is a noticeable difference, the algorithm might have become biased during training. If there is any reason to suspect a bias, you’ll need to perform additional exploratory tests with tailored data sets to confirm or disprove your suspicion.

The Right Tools

These complexities might lead you to conclude that you’ll need highly specialized tooling to test your deep-learning system. However, rest assured that most of the hard work is taken over by the AI developers.

Weight calculations, data processing, and result evaluation are already woven into the neural network during the development process, as they are required right from the beginning. Once the neural network is built, you can pass any data set into it and it will output the result, along with the overall accuracy of the said result. All there’s left to do is to switch your development set with your test set and look at your network’s performance. No new tools are required for that.

It’s All Still Testing

Testing AI systems are not that different from testing deterministic tools. While there are big differences in the details, it’s still the same process: Define your requirements, assess the risk associated with failure for each test case, run your tests, and evaluate whether the weighted, aggregated results are at or above a defined level. Then add some exploratory testing into the mix to find bugs in the form of biased results. It’s not magic; it’s just testing. Same applies to how we test it functionally, as it does to the non-functional testing areas and AI code should still be tested for optimization on a performance perspective ability to handle load with increasing datasets and especially in security, as we cannot afford for these algorithms to leak access to all this data they are processing and decisions they are making.

CRAIG RISI

Software Testing in an AI-driven world – Part 2 – Testing AI systems

Recent Posts

R