Category Data Engineering

Data Engineering

Apache Parquet: Introduction and key concepts

When dealing with significant amounts of data, the way you store it can make the difference between success and failure. In this post, we’re going to take a look at a file format that may not be the most popular,…

Paul Sasieta
2024-04-19

Data Engineering

Custom Data Source in Spark 3

In 2020 Apache Spark released version 3.0.0.0 which introduced some changes to the API for defining custom data sources, known within the Spark environment as Custom Data Source. These were previously used through DatasourceV2, which generated confusion and an unintuitive…

Ocean Berlinghieri
2024-04-02

Data Engineering

DBT Models, Snapshots and Materializations

In other discussion spaces we elaborated a comparison between DBT, Pentaho and Spark to perform data transformations. In this post, we will look at some of the key concepts of DBT: models, snapshots and materializations. Within the context of DBT,…

Vanessa Pradas
2024-01-25

Data Engineering

Cost optimization best practices in BigQuery

BigQuery has become in recent years a powerful tool for data storage and analysis in the cloud. Its size, scalability and all the features it offers would be difficult for a user or company to replicate from scratch due to…

Miguel Acedo
2023-12-15

Data Engineering

Differences between DBT, Pentaho and Spark for transforming data

We know that there is a large number of products in the data engineering ecosystem to perform the data processing of a company and that most of them provide the necessary tools to be able to perform many types of…

Vanessa Pradas
2023-11-17

Data Engineering

How to Integrate DBT with Apache Spark: Step-by-Step Guide

In this post we are going to talk about how DBT integrates with Spark and how this integration can be useful for us. DBT is a framework that facilitates the design of data modeling throughout the different data modeling cycles.…

Óscar García
2023-10-27

Data Engineering

Blocking calls and asynchronous programming with Scala

This post is about the impact that blocking code can have on an application and the importance of using libraries that natively support asynchronous programming. The goal is to understand that a blocking call is always blocking. In the article,…

Paul Sasieta
2023-05-12

Data Engineering, Software

Apache Hop plugin for Apache Airflow

Since our Pentaho PDI plugin for Apache Airflow release, we have seen an industry shift towards the usage of Apache Hop for data processing. What is Apache Hop? Apache Hop started (late 2019) as a fork of Kettle PDI, is…

Pere Alzamora
2022-09-30

Data Engineering

Machine Learning in Docker containers

If you have ever shared code, it is quite likely that you have said “well, it works on my machine” when you see how others have difficulties running it. Incorrect configuration, version differences or uninstalled dependencies are often some of…

Sergio Pérez
2022-08-11

Data Engineering

Data Governance using Apache Atlas

Today I would like to deal with a topic that, from my point of view, is very important and is probably the holy grail of data engineering projects. However, we rarely reach the necessary level of maturity to be able…

Óscar García
2022-05-13