How to Automate Data Quality Tests in Your Database?

In this article, I will be focusing on implementing test automation on data quality, meaning testing the data structure and the data already stored in the database. Rather than, terms such as primary, unique and foreign key constraints. To understand the reasons for this procedure,  learn more about the importance of database unit test automation.

As a database developer, spending time on coding and new requirements, often doesn’t leave enough time to perform as many tests as I would like to. The DBMS has built-in checks that reduce the chance to make mistakes, although errors can still occur for 2 main reasons:

  1. Errors in our database could be the result of missing checks at the moment the database went live. Maybe you’ve jumped on the project when it already went live or you simply need to continue the work somebody already messed up.
  2. Errors could also be the result of data transformation. Just think of building a data warehouse when you want to “compact” data from the operational database. The transform process could be coded poorly and thus led to number of errors and decrease the data quality significantly. 

So, the question is how to find semantic errors and not syntax errors (the DBMS will do this part for you).

From my experience, I would go with a few controls before adding or modifying the data:

  1. Checking the database schema using values from the information schema database. This is possible if we use naming convention in our database. E.g. if we’ll use “client_id” as the name for the foreign key attribute that references the “id” attribute in the “client” table, then it is relatively easy to create queries that will check all such foreign key attributes and to determine if the FK constraint is defined where it should be.
  2. In order to prevent unwanted entries in the database, we should use checks on the front-end but also on the back-end. Using stored procedures to test values before inserting is a good practice because stored procedures provide us with ability to write IF … THEN statements. We can use these statements not only test values before inserts or updates but also to prevent SQL injection. Triggers could also help here but personally I don’t love to use them. The reason is that in some cases maybe you want to perform checks and in others you don’t want to (e.g. for testing purposes). If you use stored procedures, you would create two separate procedures, first one containing all checks and the other without them.

In case the data is already in the database, we can:

  1. Create procedures that will perform checks on all attributes we want to. This approach would require creating one table with list of all attributes we want to test and another list with the set of rules we want to check. For each attribute we would need to define all checks we’ll perform. E.g. If we want to test the postal code attribute we’ll need to state it should be integer number with 5 digits (in Croatia that’s the case). We could use these same checks for any attribute in our model that will be used to store postal code. For email we could also define the REGEX condition this attribute should fulfill.
  2. Another check I would perform, is counting the number of new data that was added to the table in regular intervals. This could point to an error but also that doesn’t need to be the case; if we insert new data in our DWH we can expect that we’ll have the same number of records added to the fact table. If that is not the case, this could be the sign of an error

Generally, my advice is to build a database structure and a set of procedures you could use on any database. Using the same structure will simplify your work tremendously, leaving much more time for other tasks. Make procedures your daily routine to store results of your checks and print down results of all checks. E.g. you could print the message “Everything is OK with the database, you clever handsome guy“ or list of errors in case you were not so clever 🙂

Please always be aware, that  in order to produce nice (and correct!) reports you need to take proper care of the data quality.

Good luck with your coding and testing!