This blog is based on a talk I presented a few weeks back at Selenium Conference 2020 along with my teammate Sandeep Yadav. The topic was “What to do when tests fail!” where we discussed everything about test failures, how they impact overall app quality, common causes of test failures & how to prevent them. We also went over some of the best practices for functional test automation. We finished by covering some techniques which we can use to detect failing tests at the earliest phase & what we can do to fix them faster.
So let’s start by understanding a little about test failures & their impact. Tests are meant to detect application issues & should fail only when the functionality doesn’t work as expected. A flaky test on the other hand, is one that passes or fails without any change in the application. Flaky Tests are just like Achilles’ heel of Test Automation which can lead to its downfall.
According to Greek Mythology, Achilles was dipped into the river Styx by his mother Thetis to make him invulnerable. The only portion of his body not immersed into the water was his heels, by which his mother held him. As a result, the heels were the only vulnerable part of his body. He was later killed by an arrow that struck his heel.
Test Failures – What To Do When Tests Fail?
- Impact of Test Failures
- Why Do Tests Fail & How to Prevent Tests from Failing
- Automation Best Practices – Let’s Reduce Test Failures Further
- Fail Fast – Detect Failing Tests Earlier
- Fix Faster – How to Fix your Tests Faster
- Reduced stability of the Product: Test failure analysis is a time-consuming process. If the number of test failures is high there is a high probability that proper test failure analysis is not done every single time. This ultimately results in reduced stability of the product.
- Reduced productivity of automation: These tests can also prove to be quite costly since they often require engineers to re-trigger entire builds on CI and often waste a lot of time waiting for new builds to complete successfully.
- Impacts Time, Money & Trust: Test failures increase the test creation cost, test execution cost & impacts business as there are frequent delays in releases due to excessive time required for analyzing flaky tests. This results in complete loss of trust in automation and people often stop making use of automation & start testing manually.
Test flakiness is such a common problem that it impacts all organizations, big or small. According to a publication by Google based on their test execution results:
- 84% of transitions from Pass -> Fail were due to “flaky” tests.
- Only 1.23% of tests ever found a breakage.
- Almost 16% of 4.2 million tests had some level of flakiness.
- They spend between 2–16% of their computing resources in re-running these flaky tests.
Failures Due to Changes in Application
The first cause in our list of why do tests fail is not keeping automation updated. There are new features being developed which need to be added into your automation testing. Also the existing test cases need to be updated according to the changed flow.
Failures Due to Browser Upgrades
Your perfectly running tests might start failing due to the browser getting updated to a newer version. This needs to be taken care of by updating the version of the browser driver being used since these drivers are tied to specific browser versions as we see here for Chrome, where each chrome version has a corresponding chromedriver.
Similarly for geckodriver there is a minimum recommended firefox version.
Failures Due to Poorly Written Locators
- Not using ID’s: Not using ID’s for locating elements is another mistake that I have seen many people do. This might happen due to a framework level restriction that mandates to use XPaths to keep it generic. Finding the elements using ID’s is both fast as well as stable. So if you have one, use that. Otherwise, if possible, you can ask your dev team to add them. It’s easy to add and a good practice to assign ID’s to elements. Try making use of CSS or XPath locators only in case getting the ID’s added is not feasible.
- Using Absolute Xpaths: One of the worst things you can do to to your tests is using absolute xpaths. We should always prefer Relative xpaths over Absolute as they are more robust.
Failures Due to Coding Mistakes
- Not covering all possible application states: It’s super important to have good knowledge about all the flows & possible conditions of your app. It will help you write robust scripts covering all the possible states of the application under test, preventing your tests fail intermittently.
- Writing Dependent tests: Writing dependent tests is just like playing with some Domino blocks where you have lined them in a way that one failure impacts all. Instead, write independent tests to make your script more robust and lesser prone to failures.
- Not using waits: A major reason for tests being flaky is using sleeps instead of waits. Avoid sleeps as you can never be sure if the page is in the expected state or not. This will result in your test passing sometimes & failing the other times, leading your test to appear flaky. Instead, use Implicit wait to set a timeout at the driver level:
OR make use of explicit waits to check for a particular condition to become true. Using these waits correctly will help you achieve stable test results.
WebDriverWait wait = new WebDriverWait(driver,20); wait.until(ExpectedConditions.elementToBeClickable(By.id("loginemail")));
Failures Due to Execution Environments
Tests can also fail due to differences in execution environments. You should always keep in mind the environments you are going to run your scripts on. Test or staging environments may have different hardware configurations & can be slower or faster than you expected. Similarly your automation execution infrastructure can have slower machines than your laptops. Network conditions also might be different. Issues due to network latency can especially be true if you are executing cases on a cloud testing provider.
Failures Due to Parallel Execution
Parallel execution is no doubt a great feature to have, but you should always remember the following:
- Hardware configuration of machines: They again might be slower and there is a limit that you will be able to run your tests concurrently, post which tests might start failing more.
- Software configurations also impact your tests. Different browser versions, a different OS version from your local system or even across grid machines can make a passing test fail.
- You also need to take care of data sharing between tests running in parallel. For example, your tests might be using the same user credentials across tests. This is fine when tests run sequentially but once you are running tests concurrently you need to see if multiple sessions are allowed in your application at the same time.
- Another case can be of race conditions where multiple tests are making changes on the same page, validating the entered info at the same time. These tests will start failing & appear flaky.
Till now, we have looked at some common causes of test failures & how we can prevent them. Now let’s look at some best practices that will help us reduce the test failures further.
Know How Selenium Works
The first best practice is knowing how Selenium works in & out. There are some great sites available on the internet where you can learn about the same but I highly recommend going through the documentation available on the official site selenium.dev.
Also, you can have a look at Selenium repo on github for more details about any Selenium class or functions that you are using. If you think you found a Selenium bug, you can look for any existing issues that people might have already filed in under github issues.
Use Source Control
Try using a source control to manage your code better, especially when multiple people are working on the same codebase. You can use git or svn for the same.
Know What to Automate
We need to know what to automate. Generally speaking the more repetitive the test, the better candidate it is for automation. So look at business-critical paths, tests requiring runs on different variety of data or flows that are tedious to do manually repetitively. Also, remember that tests are not the only candidates for automation. Tasks such as setting up or creating test data for manual testing are also great candidates for automation.
Know Where to Automate
Here we have the famous testing pyramid that shows us that Unit tests are fast & UI tests are slow. Similarly, Unit tests have a lesser cost attached to them in terms of early detection & lesser effort required for fixing when compared to UI tests. So our aim should be to have a good chunk of our tests at Unit or Service/API level & fewer tests at UI level.
I highly recommend learning more about the testing pyramid on the below sites:
Spend Time on Designing your Scripts
Another important practice is to not jumping to the coding directly. Understand more about your app functionality & think about what will be the best way to design your scripts.
Now let’s discuss some techniques of getting to know details about failing tests faster, so you can take corrective action at the earliest stage.
Use Static Code Analysis
Static code analysis is a method of debugging by examining source code before a program is run. It’s done by analyzing a set of code against a set (or multiple sets) of coding rules. Some static code analysis tools which you can use based on your requirement are:
- PMD: identifies potential problems mainly unused and duplicated code, unused variables, empty catch blocks, unnecessary object creation, and so on.
Checkstyle: analyses source code and looks to improve the coding standard. It verifies the source code for coding conventions like headers, imports, whitespaces, formatting, etc.
Running Tests on Under Development Builds
Automation should be started as early as possible and ran as often as needed. The earlier you start automation in the life cycle of the project – the better. It will help you catch issues early & additionally you will get to know any impact on your existing automation script. Bugs detected early are a lot cheaper to fix than those discovered later in the development cycle or production
Triggering Slack Notification with Failure Details
You can send a notification about a test case failure on your Slack channel. This can be done in two ways:
- You can use “Slack Notification” plugin in Jenkins.
- If you want customized messages on your Slack channel, you can try using Incoming webhooks. A good use case for using this can be informing of any configuration failures or any P1 issues so that they can be acted upon quickly.
- TestProject‘s free automation platform also easily enables integrating your (automatically created) reports directly to your Slack channel.
- You can use “Slack Notification” plugin in Jenkins.
Triggering Email Notification with Failure Details
You can also trigger an Email notification for high priority tests in case they fail. For this you can write a custom function for sending emails using “java mail” jar in your project. Or utilize this built-in capability by TestProject for setting email notifications.
Triggering SMS with failure details
Similarly, you can also implement SMS notification service in case of test case failures. You can send messages for reporting the status of a test or even at an aggregate level informing about the total number of pass/fails. There are many SMS service providers available that provide API support to send SMS. If your organization is already using one for their business use case, you should be able to use the same service quite easily for getting to know the status of your test faster.
Understanding Common Exceptions
Knowledge of some common Selenium exceptions in advance can help you to understand the root cause of test case failure and help fix them quickly.
Visualization of Test executions Over Time
You can build up a visualization of your test executions over time using tools like Kibana & elastic search. This will provide you with insights such as which cases tend to fail more or are flaky, what are the common exceptions due to which of your scripts fail or which of your test cases took more time for execution. You can use this info to drill down to identify root causes of your test flakiness quickly. For more info on this you can refer to this elastic blog or this Selenium Conf talk.
If the coding standards are followed, the code is consistent and can be easily maintained. The need occurs because multiple people might be working on the same codebase in parallel or over a period of time. Using a framework for automation testing will increase a team’s test speed and efficiency. To avoid hard coding of locators try using POM (page object model).
Reporting not only makes you aware of the status (success or failure) of your automation runs, but it also helps you in finding out the root cause of the bugs quickly. So, you should spend time on deciding how you are going to generate reports for your Selenium Webdriver project. There are many reporting options available such as TestNG reports, Extent Reports (More readable & has pie charts) & Allure Reports (Provides annotations such as @Severity, @Step etc.). You can also use Selenium-based solutions such as TestProject which provides out-of-the-box automatic reports (including screenshots!). You can read more about the comparison between such various reporting tools in this article: The 8 Leading Test Reporting Tools for Selenium.
Logging is the process of writing log messages during the execution of a program to a central place. By making use of logging properly you will be able to save a lot of debugging time. There are many options for enabling logging. You can use log4j or ExtentTest method which is present in extent report.
Using Screenshots and Video Recording
If a test case fails in an automation run it can be time consuming to re-test the entire flow in order to reproduce the issue for knowing the failure cause. This is where screenshots & video recordings come in handy, as they help detect test failures instantly. Tracking the execution becomes much easier, especially if you are working on a headless browser. To capture a screenshot in Selenium, we can make use of an interface, called TakesScreenshot (or use it in TestProject’s built-in capability). For video recording, you can use ffmpeg tool which is open source. Using ffmpeg, it’s also possible to take multiple screenshots of your execution after a small interval of time and then merge those screenshots to create a .mp4 video file for you to review the execution.
What is a Flaky Test in Software Testing?
A ﬂaky test is an analysis of web application code that fails to produce the same result each time the same analysis is done. It will show that the code passed the test and the application worked as planned or it will show that the code failed the test and didn’t work as planned.
What Should I Do When a Test Fails?
Tests usually fail due to server and network issues, an unresponsive application or validation failure, or scripting issues. When failures occur, it is necessary to handle these test cases and rerun them to get the desired output. An efficient way to do this is to use the TestNG suite.
What are the Causes of Failures in Software Testing?
- Environmental conditions, which might cause hardware failures or changes in any of the environmental variables.
- Human Error while interacting with the software by keying in wrong inputs.
- Failures may occur if the user tries to perform some operation with the intention of breaking the system.
And we are done 😎
I hope that the blog has given you enough overview of What to do when your tests fail!
Thanks for reading and Happy Testing! 🚀