In this post we are going to talk about how DBT integrates with Spark and how this integration can be useful for us.
DBT is a framework that facilitates the design of data modeling throughout the different data modeling cycles. DBT is normally used for data modeling for analytical purposes. In this way, using DBT we interpose a development layer that adopts a DSL very close to that of traditional SQL to speed up ELT development.
This possibility, together with its ability to be transversal to different data warehouses or Data Warehouses, are the main virtues offered by DBT compared to the rest of the alternatives.
Data Transformations with DBT and Spark
Next, we will see a case study of how to use Spark as an engine for transformations modeled with DBT.
The first thing we will need is to raise the right environment to develop this POC. In our case we will use the following binary dependencies:
- spark-3.0.2-bin-hadoop3.2
- python 3.8
spark/sbin ./start-thriftserver.sh
With this command, we launch a Thrift Server service in Spark that, in turn, allows to execute interactive SQL queries in Spark through a JDBC/ODBC connection. The raised service exposes by default port 10000 which will be used to connect DBT to Spark. If we connect via a database development IDE to the URI ‘jdbc:hive2://localhost:10000’, we will be able to perform queries directly on Spark, that is the same principle that DBT leverages to be able to perform this integration.
The database modeling process is similar to other DBT developments, but it is important to highlight the specific capabilities of the Spark integration.
First of all, I would like to highlight that it allows us to develop models that can not only persist in an indexed database type support, but can also be stored in a file system and in all the usual formats (parquet, csv, orc, etc.). This gives us the flexibility to move to an ETL paradigm, which is particularly interesting, since in many occasions we may need to work with a data model that is not supported in a Data Warehouse.
However, in most integrations with DBT, we find that the latter is an indispensable piece within the architecture.
Data modeling project
In this GitHub repository of Damavis we can find a very simple code that allows us to build a DBT environment with the necessary dependencies and also a simple modeling project. So, we clone the repository locally and proceed to install the Python dependencies using Poetry. I recommend that this is done using a virtualenv.
pip install poetry
poetry install
Once the dependencies are installed, we must create our profiles.yml, which contains the connectivity between DBT and the Spark service opened on port 10000.
damavis_dbt_spark:
target: local
outputs:
local:
type: spark
method: thrift
host: localhost
schema: test
port: 10000
DBT has several connection types with Spark ODBC, Thrift, Http and Session:
- ODBC is a method that allows us to connect via Databricks SQL connection driver.
- Thrift allows us to connect to a Spark server such as an Amazon EMR or HDinsight.
- Http allows us to connect to a cluster but through an HTTP service, currently (v1.6) it can only be integrated with the Databricks interactive cluster.
- Session allows you to connect to a pySpark session launched locally.
We have already configured the connection between DBT and our Spark locally. Let’s see now the source code. In the dbt_project.yml file we find the first peculiarity of Spark development:
models:
damavis_dbt_spark:
+file_format: csv
+location_root: /hive/damavis
+materialized: table
Here we see that we can specify by default that our models are persisted with a specific format using “file_format” or that it is stored by default in a specific path “location_root”. DBT supports multiple formats with Spark such as parquet, delta, iceberg, hudi, csv, json, text, jdbc, orc, hive or libsvm.
If we execute dbt run
, we will see that the following folder structure is created:
We can see that the meta-data schema such as the _SUCCESS file is created when the Spark stage terminates successfully.
In the example_partition_model we see some of the options we have to persist the data:
{{ config(
partition_by=['year','month','day'],
file_format='csv',
options={'header': 'true'}
buckets= 100
) }}
select
id,
2023 as year,
10 as month,
id as day
from {{ ref('example_view') }}
In this way, we can specify persistence options as in Spark in the method “write.option(“header”, True).csv(“path”)”. We can also set ‘partition_by’ options, in our case by ‘year’, ‘month’ and ‘day’, or the number of records per each of the files that are stored using the ‘buckets’ option.
Conclusion
DBT allows us to develop models within an HDFS by using Spark as the execution engine, allowing us to leverage the strengths of each of the frameworks.
On the other hand, DBT can help us to model in a fast way, document, ease the learning curve and, finally, manage relatively complicated operations to develop in Spark such as: data quality and integrity tests in each execution, snapshot systems from DBT, seeds, etc. While Spark, on the other hand, allows us to use the muscle of a Spark cluster to deal with a large amount of data without the need of a Datawarehouse system.