We know that there is a large number of products in the data engineering ecosystem to perform the data processing of a company and that most of them provide the necessary tools to be able to perform many types of tasks (type conversion, data cleansing, validation, etc.). However, when is it better to use one or the other? What are the advantages and disadvantages of each one? Do they have any associated cost? Can a person of any technical profile use them? We have many more questions!
Next, we will answer all these questions that we have been asking ourselves. We will explore some of the most popular tools today and compile a number of key aspects in a table.
Among the suite of Pentaho tools we find Pentaho Data Integration (PDI) or Kettle, which provides utilities for the integration and transformation of data (ETL) from different sources. Its use is focused on a data analyst profile, although any user with Business Intelligence skills can use it, as it provides a graphical user interface.
PDI is an open source system and is designed to process static data. To do this, it uses row-based transformation models. It can collect data from a cloud storage system, such as AWS, in databases or local in various formats (TXT, JSON, CSV, etc.) and load it into a storage system such as a Data Warehouse or Data Mart when it has been transformed.
Some of the disadvantages of using Pentaho include the fact that, although it has a graphical interface, its aesthetics are not suitable for all tastes and its performance is low compared to other distributed ETL tools. In addition, each of the tools offered by the Pentaho suite is designed by a distinct developer, so the visualizations of their interfaces are radically different. On the other hand, PDI presents two download options, Pentaho Community Project and Pentaho Enterprise Edition: the former is a totally free option, but has a smaller set of utilities and increases the number of pipeline steps for the same number of tasks with respect to the Enterprise version, while the latter provides the complete utility pack, but the price is not public, but the company must contact to obtain it.
The Apache Spark distributed computing framework gives us utilities to analyze and transform large volumes of data in a cluster of machines. To use it, we need a system to run the code as a notebook on a local or virtual machine such as Databricks, since it does not have an application with a graphical user interface, but provides APIs for Java, Scala, Python and R. This means that its use is aimed at a developer profile for any of these languages.
Spark allows us to use data parallelism to optimize processing speed, which is accompanied by a huge RAM and disk usage, to be distributed manually by the developer. This means that you can select which data will remain in memory or on the hard disk. It also has the advantage that all its utilities are unified, so that, unlike Pentaho, Spark can apply all ETL processes to a set of data and generate graphs of these in the same task pipeline. In addition, it can work with real-time data. It is free software, since it is an open source system and has an active community of developers who maintain it.
On the other hand, since it does not have a graphical interface and provides a very flexible project configuration system, the complexity of use increases drastically with respect to the Pentaho tool. This is because it requires programming skills for some of the languages described above and the knowledge to configure development environments. In addition, the large use of memory to increase its speed can result in the software cost savings in final terms leading to a cost in equipment or cloud resources.
Data Build Tool (DBT)
The last tool we will see is DBT, which is an open source system that works only with SQL to perform the data transformation, so it is focused on a mixed profile between those described for the two tools we have seen previously. This solves the complexity of use with respect to Spark, while providing greater flexibility than Pentaho in the construction of transformation pipelines and in the customization of the models it will contain. In addition, its default architecture allows standardizing the processes of the data transformation workflow, which facilitates the maintenance of the code, favors its scalability and allows its usability. To this end, it uses jinja models, snapshots and macros that allow flexibility when programming functionalities to solve specific needs.
DBT Core is the system that is located locally on our machine and where the code of our transformation pipeline will be located. To provide ease of use to the developer, its architecture presents a directory system, where each type of specific resource is located: analyses, macros, models, seeds and snapshots. It also presents simplicity in the generation of predefined tests in its libraries or customized by the user.
On the other hand, we find the DBT Cloud platform as a remote system, which can connect to the git repository where we store the code to run the project and provide monitoring of the data transformation pipeline processes. This allows us to track possible errors in the code in an intuitive way. The most characteristic section of this platform is the documentation section, where we can find a description of what each of our tables, macros and snapshots consists of, as well as observe the columns, the tables on which it depends and complementary details of the same. Finally, it has a color-customizable graph with the global vision of the implemented solution.
Comparative table between DBT, Pentaho and Spark
|Utilities / Tool||PDI||Spark||DBT|
|Who is it intended for?||Data Analyst||Developer||Mixed|
|Scalability||Horizontal *||Horizontal||Horizontal *|
|Open Source||Community Edition||Yes||Yes|
|Graphic interface||Yes||No||DBT Cloud|
|ETL or ELT Strategy||ETL||ETL||ELT|
* It is usable in distributed computing environments, but may require additional configuration.
** DBT Core and DBT Cloud for a developer are free versions. However, DBT Cloud for teams and the Enterprise version have an associated cost which can be found in webpage.
In this article we have seen three tools with certain similarities and differences, which try to solve the task of data transformation collected in the area of Big Data. When choosing one of them for a project you are developing you have to take into account numerous factors that we have summarized in the table above.
In general, PDI is more focused on a data analyst profile and has the lowest learning curve, so it could be used in ETL projects with less complex data transformations.
If, on the other hand, we are experienced developers in some of the programming languages described above and we work with real-time data with data that require very complex transformations, perhaps the best alternative is Spark.
Finally, if our data has great complexity, is not real-time and we want to save time in the development using SQL, the best alternative is DBT.