Category Data Engineering

Kafka + Spark for Batch processing

How to leverage Streaming technologies like Apache Kafka and Apache Spark for Batch processing

How to leverage Streaming technologies like Apache Kafka and Apache Spark for Batch processing ETL process. Central piece of the Big Data project Collecting, ingesting, integrating, processing, storing and analyzing large volumes of information are the fundamental activities of a…

Introduction to Apache YARN

Introducción a Apache YARN

Basic Single Node Configuration Note: the code of this post has been tested using Apache Hadoop 2.10.1. Please check out our previous post, Introduction to Apache Hadoop, to configure this version of Hadoop, in case you have not done it…

Introduction to Apache Hadoop

Introduction to Apache Hadoop

Single Node Configuration Without Yarn Sometimes it might be a bit overwhelming to understand the role of the most common open source technologies used in big data contexts. For example, probably most of you have heard about tools such as…

Pentaho PDI Plugin for Airflow

Schedule, orchestrate and monitor your Kettle tasks with Airflow with this Pentaho plugin. At Damavis we know the importance of data processing. Extracting, cleaning, transforming, aggregating, loading or cross-referencing multiple data sources allows our clients to have Insights or Predictive…