How to leverage machine learning screenshot recognition to identify an application failed state

Using AI machine learning image tracking to detect app failures 

Image classification is one of the classic use cases for machine learning (ML). Current AI systems are amazing at determining whether something is a cat or a dog. Can the same tactics be used to identify whether an application works or fails, providing AI with a screenshot for reference? It seems like it may. Most likely, it has some implications for software testing.  

For example, you can add an additional assert after every test in the UI automation test suite. It’s like visual regression, except less flaky and costs less for maintenance. It does not catch some subtle UI or functionality bugs, but it does recognize if the application crashes or shows an error, which is very important for the assurance of software quality.  

Say you have a test that checks whether the title of a page is correct. It passes even when the rest of the page fails to load. Although adding a screenshot comparison for every test case would be very flaky and costly maintenance wise, having a second pass look by AI provides the best of both worlds. Using AI allows tests to check only what they need to and stops serious crashes from going unnoticed. 

Retraining an ML image classification model to recognize a failure  

NOTE: This proof of concept project is based on a tutorial on Transfer Learning from Microsoft. If you want to try this exercise on your own, you can either read the article or skip ahead and download the source code. (As you can guess C# was used here, not a Python. Machine learning is literally everywhere now.) 

In this article, I show how a pre-trained image classification model from TensorFlow is reused and retrained to recognize new image classes. The model used in this demonstration is already good at image recognition, so there’s no need to use huge amounts of data to retrain it. While I still used quite a lot of data, which might not be needed. It’s up to the tester to experiment. 

The 7-step process 

1. Update the preset images with information that’s relevant to tracking failure.  

The example Microsoft offers includes a few pictures of pizza, toasters, and teddy bears. Replace the present images with screenshots that are relevant. One could opt to label based on what is most meaningful to the product. To track failure, replace with images from the application and divided them into two classes: pass and fail.  

2. Create a screenshot database.  

The illustrations below features an excessive assembly of UI automation tests. They are executed using a framework which takes a screenshot of every click in the test. Creating a screenshot database is as simple as running all tests. ~70000 screenshots are generated using ~1.5k test cases.  

3. Dedupe screenshots.  

Using AntiDupl (an open-source tool), remove duplicates and almost-duplicates of all the screenshots. This action leaves behind about ~700 screenshots. 

4. Review the refined screenshot collection manually—moving all failed state screenshots into a separate folder.  

When moving all failed state screens that show error messages, frozen loading, etc. into a separate folder, users can then reference them as training data for the failure class. In addition to running tests, it’s possible to accumulate much more failed state screenshots from a long history of debugging failed tests in pipelines. 

The images below show examples of a good state after removing duplicates.

Samples of a failed state below:

5. Carefully review and follow the 20-80 split rule.  

After careful review, ~600 good and ~100 bad screenshots remain. Following the 20-80 split rule, move 20% of screenshots randomly from each group into separate folders for predictions testing. When choosing screenshots randomly, try out this simple PowerShell script:  

Get-ChildItem TrainingScreenshotsFolder | Get-Random -Count <number of screens divided by 5> | Move-Item -Destination TestingScreenshotsFolder

Then, use separate screenshots to test the model after the training. 

6. Create .tsv (tab-separated values) files. 

Use the .tsv files for training, testing, and to put all data in place to run a sample program.  

7. Run the sample program. 

The training only takes a few minutes. The prediction results of ~100 screenshots look like this: 

Image: fail (1).png predicted as: fail with score: 0.6008172 
Image: fail (10).png predicted as: fail with score: 0.97729 
Image: fail (5).png predicted as: fail with score: 0.9377902 
Image: fail (6).png predicted as: fail with score: 0.9007339 
Image: fail (7).png predicted as: pass with score: 0.7122028 
Image: fail (8).png predicted as: fail with score: 0.9788796 
Image: fail (9).png predicted as: fail with score: 0.8952426 
Image: pass (1).png predicted as: pass with score: 0.9713982 
Image: pass (10).png predicted as: pass with score: 0.99999 
Image: pass (100).png predicted as: pass with score: 0.8799981 
Image: pass (101).png predicted as: pass with score: 0.9698668 
Image: pass (95).png predicted as: pass with score: 0.9999971 
Image: pass (96).png predicted as: pass with score: 0.9996496 
Image: pass (97).png predicted as: pass with score: 0.9871352 
Image: pass (98).png predicted as: pass with score: 0.8122959 
Image: pass (99).png predicted as: pass with score: 0.9991549

Notice that just one of the results indicates a false positive. In the picture above, all fo the omitted results show correct predictions.

By completing these tasks, testers can now pragmatically and easily recognize whether an application has failed. Use this power responsibly!

Implications of ML on testing 

There are many interesting potential usages for failed state recognition software testing. One could imagine an autotest that clicks on UI elements in random sequence and checks whether the app has failed after each interaction. This exercise gives testers the ability to report the failure of producing sequences if there are any. Such possibilities raise important questions about automation role in QA process. 

Machine learning is undoubtedly a powerful tool which will allow testers to do much more in upcoming decades. Looking ahead, there are excellent prospects for software quality in the future—which presents an excellent opportunity for testers to become pioneers in this interesting new world of artificial intelligence in software testing.