Category Data Engineering

Data Engineering

Data Governance using Apache Atlas

Today I would like to deal with a topic that, from my point of view, is very important and is probably the holy grail of data engineering projects. However, we rarely reach the necessary level of maturity to be able…

Óscar García
2022-05-13

Data Engineering

Concurrency through Futures in Scala

When we imagine a simple programming algorithm, it is logical to think about a succession of instructions that are executed sequentially, where the next instruction will not be executed until the one immediately preceding it has been completed. However, depending…

Nadal Comparini
2022-01-13

Data Engineering

Kafka + Spark for Batch processing

How to leverage Streaming technologies like Apache Kafka and Apache Spark for Batch processing. ETL process. Central piece of the Big Data project Collecting, ingesting, integrating, processing, storing and analyzing large volumes of information are the fundamental activities of a…

Antonio Boutaour
2021-12-16

Data Engineering

Avoiding UDFs in Apache Spark

It is well known that the use of UDFs (User Defined Functions) in Apache Spark, and especially in using the Python API, can compromise our application performace. For this reason, at Damavis we try to avoid their use as much…

Cristòfol Torrens
2021-05-26

Data Engineering

Advanced Airflow: Cross-DAG tasks and sensor dependencies

Advanced Apache Airflow: Cross-DAG tasks and sensor dependencies

In this article we are going to tell you some ways to solve problems related to the complexity of data engineering itself. An Airflow DAG can become very complex if we start including all dependencies in it, and furthermore, this…

Cristòfol Torrens
2021-05-05

Data Engineering

First steps with Apache YARN customization

Basic Single Node Configuration Note: the code of this post has been tested using Apache Hadoop 2.10.1. Please check out our previous post, Introduction to Apache Hadoop, to configure this version of Hadoop, in case you have not done it…

Daniel Bestard
2021-04-07

Data Engineering

Clean Code with Alpakka Kafka

At Damavis we are very aware of the importance for our clients to have access to their data in real time. For this reason, one of our strengths is the development of tools and technologies that can move, transform and…

Joan Martín
2021-03-16

Data Engineering

Introduction to Apache YARN

Daniel Bestard
2021-03-03

Data Engineering

Deploying Airflow: CeleryExecutor on Kubernetes

How to deploy Airflow Celery Executor on Kubernetes

What is Apache Airflow and how does it work? One of the work processes of a data engineer is called ETL (Extract, Transform, Load), which allows organisations to have the capacity to load data from different sources, apply an appropriate…

Antonio Boutaour
2021-02-17

Data Engineering

Introduction to Apache Hadoop

Single Node Configuration Without Yarn Sometimes it might be a bit overwhelming to understand the role of the most common open source technologies used in big data contexts. For example, probably most of you have heard about tools such as…

Daniel Bestard
2021-02-12