Category Data Engineering

Data Engineering

Apache Spark Structured Streaming: Practical examples

In a previous article, we gave a theoretical introduction to Spark Structured Streaming where we analysed in depth the high-level API that Spark provides for processing massive real-time data streams (Structured Streaming). There, we looked at the essential theoretical concepts…

Agustín Mora
2024-08-29

Data Engineering

What is Minikube in Kubernetes and how does it work?

In the world of application development, container-based deployment environments are becoming more and more common. Kubernetes has established itself as the standard for container-based deployment. However, for many developers, setting up and managing a complete Kubernetes cluster can be a…

Óscar García
2024-08-08

Data Engineering

OData v4 protocol: Metadata and basic queries

Data processing and consumption are essential elements of the contemporary business world. Therefore, there are mysterious pieces of software, commonly abbreviated as APIs, whose role is fundamental in this traffic of information. APIs (Application Programming Interface) are mechanisms for integration…

Lluc Sementé
2024-07-25

Data Engineering

Theoretical introduction to Spark Structured Streaming

In recent years, data processing with low latency, practically in real time, is becoming a requirement increasingly demanded by companies in their big data processes. It is in this context where the concept of stream processing is introduced, which refers…

Agustín Mora
2024-07-17

Data Engineering

Data Relationship Models in a Data Warehouse

In the field of Data Engineering, efficient database design is essential to handle large volumes of data and provide effective analysis. Throughout my experience as a Data Engineer, I have worked with the main data relationship systems and have observed…

Óscar García
2024-06-28

Data Engineering

Vector database: What it is and how it works

This article assumes that there is a basic knowledge about embeddings of objects, either text or images. In case you don’t have any notions on the subject, the post on Text Embeddings: the basis of modern NLP explains this concept.…

Antoni Casas
2024-05-30

Data Engineering

Apache Parquet: Introduction and key concepts

When dealing with significant amounts of data, the way you store it can make the difference between success and failure. In this post, we’re going to take a look at a file format that may not be the most popular,…

Paul Sasieta
2024-04-19

Data Engineering

Custom Data Source in Spark 3

In 2020 Apache Spark released version 3.0.0.0 which introduced some changes to the API for defining custom data sources, known within the Spark environment as Custom Data Source. These were previously used through DatasourceV2, which generated confusion and an unintuitive…

Ocean Berlinghieri
2024-04-02

Data Engineering

DBT Models, Snapshots and Materializations

In other discussion spaces we elaborated a comparison between DBT, Pentaho and Spark to perform data transformations. In this post, we will look at some of the key concepts of DBT: models, snapshots and materializations. Within the context of DBT,…

Vanessa Pradas
2024-01-25

Data Engineering

Cost optimization best practices in BigQuery

BigQuery has become in recent years a powerful tool for data storage and analysis in the cloud. Its size, scalability and all the features it offers would be difficult for a user or company to replicate from scratch due to…

Miguel Acedo
2023-12-15