How To Install Apache Hadoop On Ubuntu 22.04

In today’s digital age, the ability to process and analyze large data sets is crucial for businesses to stay competitive. Apache Hadoop, an open-source software framework, is a powerful tool that allows for distributed processing of large data sets across clusters of computers. This article will guide you through the steps of installing Apache Hadoop on Ubuntu 22.04, a popular choice for a Linux operating system due to its robustness and user-friendly interface.

Requirements

Make sure you meet the following requirements before beginning the installation:

A VPS running Ubuntu 22.04 from Virtono.
A user account with sudo privileges.
Stable internet connection for downloading the necessary files.
Basic knowledge of Linux commands and the terminal.

Let’s begin the process of installing Apache Hadoop on Ubuntu 22.04.

Table of Contents

Step 1: Update Your System

First, update your Ubuntu system to ensure you have the latest packages and security patches. Open your terminal and enter the following commands:

sudo apt update && apt upgrade -y

Step 2: Install Java

Apache Hadoop requires Java to run. So, we will install OpenJDK, an open-source implementation of Java. Run the following command:

sudo apt-get install default-jdk -y

How to Install Apache Hadoop on Ubuntu 22.04

After the installation, verify the Java version with this command:

java -version

Step 3: Create a New User for Hadoop

For security reasons, it’s recommended to create a separate user for Apache Hadoop on Ubuntu. Use the following commands to create a new user and switch to it:

sudo adduser hadoopuser
su - hadoopuser

Generate an SSH key for hadoopuser. If you’re still logged in as hadoopuser, you can do this with the following command:

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
chmod 0700 ~/.ssh

Step 4: Download and Extract Apache Hadoop

Now, as a Hadoop user, download Apache Hadoop’s most recent stable release from its official website. Use wget to download it:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

After downloading, extract the tar file with the following command:

tar xvf hadoop-3.3.1.tar.gz

Step 5: Configure Apache Hadoop on Ubuntu Environment Variables

Next, we need to set up the environment variables. Open the .bashrc file with a text editor like nano:

nano ~/.bashrc

export HADOOP_HOME=/home/hadoopuser/hadoop-3.3.1
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Save and close the file. Then, apply the changes with the source command:

source ~/.bashrc

Step 6: Configure Apache Hadoop on Ubuntu

We will now modify Hadoop’s XML files to configure it. Open the core-site.xml file in the hadoop directory by navigating there:

nano ~/hadoop-3.3.1/etc/hadoop/core-site.xml

Add the following lines inside the <configuration> tag:

<property>
   <name>fs.defaultFS</name>
   <value>hdfs://localhost:9000</value>
</property>

Save and close the file. Repeat this step for hdfs-site.xml, mapred-site.xml, and yarn-site.xml files, adding the appropriate properties for each.

nano ~/hadoop-3.3.1/etc/hadoop/hdfs-site.xml

nano ~/hadoop-3.3.1/etc/hadoop/mapred-site.xml

nano ~/hadoop-3.3.1/etc/hadoop/yarn-site.xml

Step 7: Format the Hadoop Filesystem

Before starting Hadoop, format the Hadoop filesystem with the following command:

hdfs namenode -format

Step 8: Start Hadoop

Finally, start Apache Hadoop on Ubuntu with the start-all.sh script:

start-all.sh

If everything has been set up correctly, Hadoop should now be running on your Ubuntu 22.04 system.

You can check the status of your Hadoop cluster and its components using several methods:

Hadoop Daemon Status – The jps command can be used to check the status of the Hadoop daemons (NameNode, DataNode, ResourceManager, and NodeManager). This will display a complete list of all active Java processes, including any Hadoop daemons. Run the following command:

jps

The output should show the Hadoop daemons that are running, such as NameNode, DataNode, ResourceManager, NodeManager, etc.

HDFS Status – You can check the status of the HDFS using the hdfs dfsadmin -report command. This will give you information about the capacity, the amount of data stored, the number of data nodes, etc. Here is the command:

hdfs dfsadmin -report

Web Interface – Hadoop provides a web interface that you can use to check the status of your cluster:

Use http://<resourcemanager-host>:8088/

Replace <resourcemanager-host> with the hostname or IP address of your ResourceManager, respectively.

Final thoughts

Congratulations! You have successfully installed and configured Apache Hadoop on Ubuntu 22.04. With this powerful tool, you can now process and analyze large data sets efficiently. Remember, learning how to use Hadoop effectively can be a game-changer for your data processing needs. Happy data crunching!