How to Install Hadoop on Linux

Hadoop is a powerful, open-source framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. Setting up Hadoop on a Linux system is a critical skill for data engineers and developers involved in big data analytics. This guide provides a step-by-step tutorial for installing Hadoop on a Linux environment.

Prerequisites

Before installing Hadoop, ensure your system meets the following prerequisites:

A Linux-based operating system (such as Ubuntu, CentOS, or Red Hat)
Java Development Kit (JDK) version 8 or above
SSH installed and configured

Step 1: Install Java

Firstly, ensure that you have Java installed on your system, as Hadoop requires Java to run. Use the following command to install OpenJDK:

sudo apt-get update
sudo apt-get install openjdk-8-jdk

Verify the Java installation:

java -version

Step 2: Configure SSH

Hadoop requires SSH access to manage its nodes. Set up SSH by installing OpenSSH:

sudo apt-get install openssh-server openssh-client

Ensure that SSH key-based authentication is configured for passwordless logins:

ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Step 3: Download and Extract Hadoop

Download the latest Hadoop binary from the official Apache Hadoop website and extract it. Use the following commands:

wget https://downloads.apache.org/hadoop/common/hadoop-X.Y.Z/hadoop-X.Y.Z.tar.gz
tar -xzf hadoop-X.Y.Z.tar.gz
sudo mv hadoop-X.Y.Z /usr/local/hadoop

Step 4: Configure Hadoop Environment Variables

Edit the .bashrc file to set Hadoop environment variables:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Apply the changes:

source ~/.bashrc

Step 5: Configure Hadoop

Edit the following configuration files to set up Hadoop:

hadoop-env.sh – Set JAVA_HOME
core-site.xml – Configure Hadoop with default block size and replication factor
hdfs-site.xml – Define the name node and data node path
mapred-site.xml – Set up MapReduce properties
yarn-site.xml – Configure YARN settings

Step 6: Format the Name Node

Before starting Hadoop services, format the Hadoop file system:

$HADOOP_HOME/bin/hdfs namenode -format

Step 7: Start Hadoop Services

Start Hadoop services using the following commands:

start-dfs.sh
start-yarn.sh

Verify the Hadoop services status by accessing the Hadoop web interface at http://localhost:9870.

Troubleshooting Tips

Ensure all environment variables are correctly set in .bashrc.
Check the SSH configuration and ensure all involved nodes can communicate without a password.
Consult Hadoop logs located in the logs directory for specific error messages.

Conclusion

By following these steps, you can successfully install Hadoop on a Linux system. This setup enables you to harness the power of distributed computing and manage large datasets efficiently. For further customization and optimization, refer to the Hadoop documentation and community forums.

For additional information about managing big data frameworks, consider checking our guide on installing Apache Spark.