“Take nothing on its looks; take everything on evidence. There’s no better rule.”
— Great Expectations – Charles Dickens
Back in April we ran our first East Anglia Quality Assurance meetup where Technical Architect Brendon Bezuidenhout-Green gave a great talk on the importance of database testing. The recording can be found here. Since then I have been on the lookout for tools which can assist with database testing. Subsequently, I came across Great Expectations after listening to a podcast.
Great Expectations is tool which I think can be used by testers to help automate the testing of databases & data. What is it? I hear you ask, well, shall we take a closer look? 🧐
Table of Contents – Database Testing with Great Expectations
- What is Great Expectations
- How does Great Expectations work?
- How can we use Great Expectations?
- Getting started with Great Expectations
- Adding & Editing Expectations
Great Expectations is an open source Python framework for writing automated data pipeline tests. It integrates with many commonly used data sources including, MySQL, Postgres, Pandas, SQLAlchemy and many others (checkout their website for the full list).
As testers, we’re used to writing assertions for our API’s and UI’s to check the outcome aligns with what we expect. Well, Great Expectations lets us do the same for our data. Cool yeah? 😎
Great Expectations houses an impressive list of inbuilt assertions, known as expectations (hence the name). Expectations can be run against the structure of the data source or against specific columns of data. For example, running the expectation
will check if the UserName column exists in the selected data source.
checks that there are no null values in the UserName column.
You can also validate string patterns using regex, check if values full inside a range or set, confirm that all values are unique and so many more. Check out the full list of expectations here.
Not only that, but you can also use the ‘mostly’ parameter to set tolerance levels on your expectation. This is really handy if you know some of your data won’t meet the expectation but you don’t want your test to fail. For example:
The above expectation asserts that the AccountNumber in the data source should be between 5 and 6 characters in length 80% of the time. If the data falls within this tolerance then the expectation will pass.
The great thing is once you’ve set up your expectations you can run it against new data sources.
Now, I’m a huge advocate of making automation testing more accessible to all members of the team, which is why I think Great expectations is so, err. well, great!
Following installation (through PIP) team members can write, manage and execute expectations directly from a Jupyter notebook. This is a nice feature as it removes the need for an external IDE and gives less technical team members a platform to explore the data, write expectations and measure data quality.
Once an expectation suite has been run, an HTML report known as a data doc gets generated. Data docs detail available expectation suites, previous runs and allow team members to drill down into passed & failed expectations.
Looks good, right? We hear a lot about the ‘whole team approach’ to testing these days and its tools like this which really make it possible. And for those who want a bit more no need to worry. Great Expectations can be run from the command line or through Python and can be integrated into your data pipeline.
Want to find out more? Keep reading and we’ll look at how to set up and run our first expectation suite on a small dataset.
Right, now we know a bit about Great Expectations. Let’s get started and create our first set of expectations. I’m using a Windows machine so some of the commands may be slightly different if you’re on Linux or a mac.
To get started, let’s create a directory from the command line.
Now let’s set up a virtual environment and install Great Expectations by running pipenv install great_expectations from inside our new directory.
Cool! Now our virtual environment has been created and Great Expectations is installed. To start our virtual environment we just need to run pipenv shell. Now that we’re inside our environment, we can initiate Great Expectations. The initiation process will allow us to specify a data source. Once selected, Great Expectations will profile some of the data and produce a sample data doc. This is where we want to get to. It’s going to involve a lot of entering ‘y’ into the command line but bear with it, you’re nearly there.
We’ll start by running Great_expectations init in our command line.
First, Great Expectations will ask for permission to create a new structure inside of your directory so just enter ‘y’ to proceed.
Next, we will be asked if we want to configure a data source. We’ll answer ‘y’ to this too.
We’re going to be using a simple .csv file which contains a few users, but you can set your data source to be a database like MySQL or sqlalchemy.
Our data source looks like this:
AccNumber, UserName, EmailAddress,PhoneNumber, DoB
12345, ‘User 1’, ‘[email protected]’,’07784521225′,2000-01-01
21254, ‘User 2’, ‘[email protected]’,’01254123210′,1990-12-23
96541, ‘User 3’, ‘[email protected]’,’01254778963′,1987-05-16
66332, ‘User 4’, ‘[email protected]’,’07985451236′,1975-02-21
12354, ‘User 5’, ‘[email protected]’,’01236321456′,1978-11-16
45215, ‘User 6’, ‘[email protected]’,’01478522114′,1964-06-14
Given that we’re using a .csv file, we can select 1 to select our data.
Now we can select 1 for Pandas (we don’t need to worry too much about this at the moment) and then provide the path and filename for our file.
Getting close now. Enter ‘y’ to add your data source to your configuration. A configuration file can contain multiple data sources.
Now that we’ve added our data source, Great Expectations will profile our data in order to create an example data document with some sample expectations.
We’ll choose our data asset by entering 1 and select our user file by entering 1 again.
Feel free to change the name of the suite, but we’ll hit enter to use the default name user.warnings.
Done! Great Expectations has created our suite. Let’s create our data document now by, yep you’ve guessed it, entering ‘y’ into our command line.
Great, we’re all set up. We have a suite of expectations and a data document which gives us a nice html output of all our suites and a history of previous runs. Open the data doc by entering ‘y’.
The data doc will then open in your browser.
Before we go on let’s just recap. So far, we have installed Great Expectations, added our data source to our project and allowed Great Expectations profile and created some example expectations.
We can edit our expectations directly from jupyter notebooks. To open our notebook just enter
great_expectations suite edit users.warning into the command line.
This will open our notebook. If we scroll down in our notebook we’ll see the auto generated expectations.
Let’s add a new expectation to the AccNumber section by adding a new cell via the insert menu and adding the expectation
This will check that each value in our AccNumber column is unique. Cool, now that’s added we can run the entire notebook. By selecting ‘Run All’ from the Cell menu.
Awesome! We now have an output for each of our expectations. Let’s take a closer look at the expectation we added.
The output has returned success. That’s great! And our data doc has been updated to show our latest run, which includes our new expectation, awesome! 💪
Well, that’s it! 🎉 You’ve made it to the end of the database testing demo but in reality, we’ve only just scratched the surface. We’ve only used a small dataset and the auto generated suite. We’ve covered the basics but now it’s time to go explore… Perhaps add some more data to your dataset and rerun your suite or maybe generate the Python code so you can incorporate it into your existing test process, most of all play around and have some fun ✨
To find out more visit the Great Expectations docs.