Testing in DBT: Data and Unit Tests

Testing is a process that is always necessary to ensure that the behavior we expect from an application is also the actual behavior. Without proper testing, uploading a change can cause some uncertainty, as it is difficult to confirm that all logic continues to operate in the same way. Every change, no matter how minor, alters the implemented logic and therefore it is necessary to have safeguards. 

It is often the case that the entire testing process is done manually: from the creation of the scenarios to the validation of the results. However, this can be a problem if another person altering the same code has to recreate the scenarios. It may not even reproduce all the logic to be tested for that code. Therefore, it is always important to have tests integrated within the code itself allowing for constant testing.

To implement these tests, DBT offers several tools that we will mention throughout this post. Although, mainly, we will see two types: tests on data quality and unit tests, which are tested by using the dbt test command.

Data tests

This type of test does not check whether the implemented logic is correct or not, or whether it knows how to deal with anomalous cases correctly. It is the quality of the data that is being tested. Therefore, they are simply assumptions about the data, such as that a column can never be null, that an ID must be unique, etc.

There are two types of data tests in DBT: singular and generic.

Singular data tests

Singular tests are a simple statement made with SQL. They are useful for testing assertions specific to the logic of a singular model. If this logic is shared, it is not recommended to use singular tests, since it would be necessary to write a different test for each model.

SELECT *
FROM {{ ref(model_name) }}
WHERE random_column IS NULL AND random_column_2 = 10 

This test will fail when, in our model, we have that random_column is null and at the same time, random_column_2 is 10 in any record.

Generic data test

Generic tests are tests that are defined with parameters to allow their reuse in several models. This is extremely useful when, for example, several columns share the same logic in different models.

The essential parameters are model and column_name, referring to the model and column respectively.

{% test test_between_valid_ranges(model, column_name) %}
   select *
   from {{ model }}
   where {{ column_name }} < 0 and {{ column_name }} > 100
{% endtest %}

To use this test, it is necessary to define in the .yml of the model on which column it is applied and to indicate its name.

models:
 - name: model_name
   columns:
     - name: percent
       tests:
         - test_between_valid_ranges

To add more arguments, it is necessary to pass them as dictionary in the test definition in the .yml.

models:
 - name: model_name
   columns:
     - name: random_column
       tests:
         - test_name:
             other_argument1: valid_value_1
             other_argument2: valid_value_2

The new declared parameters can be included in the test.

{% test test_name(model, column_name, other_argument1, other_argument2) %}
   select *
   from {{ model }}
   where {{ column_name }} not in (other_argument1,other_argument2)
{% endtest %}

Unit tests

Unit tests are useful in cases where we want to check that the logic implemented in the model represents the desired logic. Since we are testing the logic, the current state of the models is not taken into account. What matters to us is that those models are transformed as we want them to be. Therefore, unlike data testing, we only need the input data and the output data of the model. 

The data can be given in various formats: SQL, CSV, dictionary, etc. However, in the case of testing ephemeral models, only SQL can be used as input data.

CSV

id, location, price
'random_id', amount, 25

SQL

select 'random_id' as id, 'random_location' as location, 25 as amount

Dictionary

{id: 'random_id', location: 'random_location', amount: 25}

In this example, we indicate the input in JSON format.

unit_tests:
 - name: test_metric_average
   description: "Check if average metric is computed correctly"
   model: dim_customers
   given:
     - input: ref('stg_customers')
       rows:
         - {id:1, amount:10}
         - {id:1, amount:15}
         - {id:2, amount:10}
         - {id:2, amount:14}
   expect:
     rows:
       - {id:1, amount:12.5}
       - {id:1, amount:12}
Conclusion

In this article, we have seen different ways of implementing tests in DBT: data quality tests and unit tests. These tests allow us to have greater security over the process, making sure that some behaviours always have to be fulfilled. In addition, they facilitate the change and development of processes without the worry that some element has changed its desired behaviour.

If you found this article interesting, visit the Data Engineering category of our blog to see posts similar to this one and share it with all your contacts. Don’t forget to mention us to let us know what you think @Damavisstudio, see you soon!

Antoni Casas
Antoni Casas
Articles: 18