Introduction to Apache Hadoop

Introduction to Apache Hadoop. How to configure and run one of the most common open source tools used in big data contexts.

Single Node Configuration Without Yarn


Sometimes it might be a bit overwhelming to understand the role of the most common open source technologies used in big data contexts.

For example, probably most of you have heard about tools such as Apache Hadoop, Apache Spark, Apache Hive, Apache Sqoop and so on. But how do you know when to use one or the other? Does one tool depend on another one?

Well, in most cases the answer is yes. With a post series written by Damavis Studio about the Hadoop Ecosystem, we will try to make everything more clear to you, not only by providing theoretical definitions but also getting our hands dirty by configuring and running the tools explained.

To introduce you to the big data ecosystem we have to start from the bottom. What is the base tool in big data? Of course, the base tool is Apache Hadoop.

Apache Hadoop

Apache Hadoop is an open source framework used to store and process big data in a distributed and fault tolerant way. There are several modules that compose Apache Hadoop. The ones we want to highlight are:

  • Hadoop Distributed File System, also known as HDFS, which is the way Hadoop stores data in a distributed and fault tolerant way. How is fault tolerance achieved? Simply by doing replication: data is copied among the different nodes so in case of node failure, its data can be obtained from its replication in other nodes. Note that this module is about data storage.
  • Hadoop MapReduce, which is the module in charge of applying the MapReduce programming model to process data in a distributed way. Therefore, the goal of this module is to perform data processing.
  • Apache Yarn, which is the component in charge of managing the jobs and resources of the cluster. This tool answers questions such as, how should a task be distributed across the cluster? Which nodes must be involved in this task? How must the work be redistributed in case of node failure? And so on. Therefore, the goal of this module is to perform cluster management.

In this post we will not introduce you to the configuration and use of Apache Yarn. That is because in order to launch an HDFS and perform MapReduce jobs in a single node configuration there is no need to set up a cluster manager such as Apache Yarn. Do not miss our next post of this series, where we will explain how to use Hadoop with Yarn!

Configuration Set Up

First, let’s download Hadoop, uncompress it and create a symbolic link (the latter is simply to ease the Hadoop upgrade in case it is necessary in the future). We save the distribution in the location ~/apache. Feel free to save the project in a location that fits best your context.

To get Apache Hadoop, execute the following commands:

wget -P ~/apache
tar -xf ~/apache/hadoop-2.10.1.tar.gz -C ~/apache
rm ~/apache/hadoop-2.10.1.tar.gz
ln -s ~/apache/hadoop-2.10.1/ ~/apache/hadoop

In order to use Apache Hadoop we need to make sure Java 8 is installed in the machine. Execute java -version to check it. If not, install Java 8 executing the following commands:

sudo apt update
sudo apt install openjdk-8-jdk -y

In order for Hadoop to run, only two environment variables have to be set, which are JAVA_HOME and HADOOP_CONF_DIR. These two assignments can be done in the file ~/apache/hadoop/etc/hadoop/

The value of these two variables must be:

export HADOOP_CONF_DIR=~/apache/hadoop/etc/hadoop
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Because we are running Hadoop on a single node, there is no need to replicate data across nodes, which is the default behaviour. To remove replication add the following property to the configuration file ~/apache/hadoop/etc/hadoop/hdfs-site.xml:


The next step is to let Hadoop know the location where the HDFS has to be built. Because we are running everything locally, we will build HDFS in localhost.

To make this happen add the following property to the configuration file ~/apache/hadoop/etc/hadoop/core-site.xml:


Finally, because when building HDFS a connection has to be done to localhost, we need to be able to ssh localhost. To do that, copy your public ssh key to the folder that contains the authorized ssh keys:

cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

Note that for the previous command to work, you must have generated ssh keys previously (ssh-keygen if no ssh key is present in your machine).

We recommend modifying the environment variable PATH to be able to run Hadoop commands from any location. For that, add the following command into .bashrc:

export PATH=~/apache/hadoop/bin:$PATH

Remember to do source .bashrc to apply the change in the current session.

Apache Hadoop is configured!

Test your configuration

To make sure that everything has been done properly, let’s launch HDFS and copy a file into it.To launch HDFS execute first hdfs namenode -format to launch the NameNode, which is a master server that keeps the directory tree of all files. Then execute the script located at ~/apache/hadoop/sbin/ Subsequently, enter in your browser the URI http://localhost:50070 to be able to see the Hadoop UI.

Let’s now create a file called tmp and copy it into our HDFS:

touch tmp
hdfs dfs -copyFromLocal ~/tmp hdfs://

You have two ways to see if the tmp file has been copied to HDFS, either listing files from the command line:

hdfs dfs -ls hdfs://

Or using the UI by clicking on Utilities/Browse the file system:

Next steps

Soon we will post the second part of this article, where Apache Yarn will be used in our Hadoop Ecosystem, which is the right way to use Hadoop.

If you found this post useful, share it with your contacts so that they can also read it and give their opinion. See you in social networks!
Default image
Daniel Bestard
Articles: 4