How to Install Apache Hadoop: Step-by-Step Tutorial
How to Install Apache Hadoop: A Step-by-Step Tutorial
Apache Hadoop is a key technology in big data processing and distributed storage, widely used in IT and data science environments. This tutorial walks you through the process of installing Hadoop in a single-node configuration, ideal for learners and developers who want to explore the Hadoop ecosystem on their own machines.
Prerequisites
- Operating System: Linux-based OS recommended (Ubuntu, CentOS, etc.)
- Java Development Kit (JDK): Hadoop runs on Java, so install OpenJDK 8 or later.
- SSH: Secure Shell configured and running for Hadoop daemon communication.
- Basic knowledge: Familiarity with Linux commands and networking basics.
Step 1: Install Java
Apache Hadoop requires Java. Install OpenJDK with:
sudo apt update
sudo apt install openjdk-11-jdk -y
Verify the installation:
java -version
Step 2: Create Hadoop User
Create a new user to run Hadoop services securely:
sudo adduser hadoopuser
sudo passwd hadoopuser
sudo usermod -aG sudo hadoopuser
Step 3: Configure SSH for Hadoop User
Hadoop uses SSH for managing nodes:
su - hadoopuser
ssh-keygen -t rsa -P "" # press Enter to accept defaults
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost # test passwordless ssh
Step 4: Download and Install Hadoop
Download the latest stable Hadoop release from the Apache Hadoop Official site.
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xvzf hadoop-3.3.6.tar.gz -C /usr/local/
sudo mv /usr/local/hadoop-3.3.6 /usr/local/hadoop
Step 5: Configure Environment Variables
Edit your .bashrc or .profile file to include Hadoop and Java paths:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Apply changes:
source ~/.bashrc
Step 6: Configure Hadoop Core Files
Edit configuration files inside $HADOOP_HOME/etc/hadoop:
- core-site.xml: Set your HDFS URI.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Step 7: Format Namenode and Start Hadoop
First, format the filesystem:
hdfs namenode -format
Start Hadoop daemons:
start-dfs.sh
start-yarn.sh
Step 8: Verify Hadoop Installation
Check HDFS status:
hdfs dfs -ls /
You can also access the Hadoop web UI at http://localhost:9870 to see NameNode status.
Troubleshooting Tips
- SSH issues: Ensure passwordless SSH is working for the Hadoop user.
- Java errors: Verify the
JAVA_HOMEpath is correctly set. - Permission problems: Ensure the Hadoop user has permissions on all configured directories.
- Daemon start failure: Consult logs in
$HADOOP_HOME/logsfor details.
Summary Checklist
- Install Java and setup environment variables
- Create and configure Hadoop user with SSH
- Download and install Hadoop binaries
- Configure core-site.xml and hdfs-site.xml properly
- Format NameNode and start Hadoop services
- Verify HDFS is working and accessible
For further readings on managing clusters and monitoring, refer to our detailed guide on <a href="/
