How to Install Pig on Hadoop: Complete Tutorial
How to Install Pig on Hadoop: A Step-by-Step Guide
Apache Pig is a high-level platform used to create programs that run on Apache Hadoop. It simplifies the process of writing complex MapReduce jobs by providing a scripting language called Pig Latin that abstracts the complexities of the underlying MapReduce tasks.
Prerequisites
- Installed and configured Apache Hadoop cluster (local or distributed) with Hadoop services running.
- Java installed on your system (Java 8 or later recommended).
- Basic knowledge of Hadoop ecosystem and command line usage.
- Internet connection to download Pig binaries.
Step 1: Check Your Hadoop Installation
Before installing Pig, verify that Hadoop is properly installed and running:
hadoop version
This command should return the version of Hadoop along with other details. Also, make sure the Hadoop daemons (NameNode, DataNode, ResourceManager, NodeManager) are running.
Step 2: Download Apache Pig
Visit the official Apache Pig (Official site) page and download the latest stable release of Pig in binary format.
Alternatively, you can use wget on your Hadoop server:
wget https://downloads.apache.org/pig/pig-x.y.z/pig-x.y.z.tar.gz
Replace x.y.z with the latest version number.
Step 3: Extract and Configure Pig
Extract the downloaded tarball:
tar -xvzf pig-x.y.z.tar.gz
Move the extracted folder to a suitable directory, for example:
sudo mv pig-x.y.z /usr/local/pig
Set Environment Variables
Edit your shell profile (e.g., ~/.bashrc or ~/.bash_profile) and add the following lines:
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
Apply the changes:
source ~/.bashrc
Step 4: Verify Pig Installation
Run:
pig -version
This should output the installed Pig version.
Step 5: Configure Pig to Use Hadoop
By default, Pig runs on Hadoop in MapReduce mode. Ensure that your HADOOP_HOME is set and that Pig’s environment can access Hadoop libraries.
Check your environment variables for Hadoop:
echo $HADOOP_HOME
If not set, add it in your shell profile similarly to the Pig environment variables.
Step 6: Running a Pig Script
Create a simple Pig script called wordcount.pig:
-- Load data from HDFS
words = LOAD '/input/textfile.txt' AS (line:chararray);
-- Split lines into words
wordlist = FOREACH words GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- Group the words
grouped = GROUP wordlist BY word;
-- Count the words
wordcount = FOREACH grouped GENERATE group, COUNT(wordlist);
-- Store the results
STORE wordcount INTO '/output/wordcount';
Run your script on the cluster:
pig wordcount.pig
After execution, check the output directory in HDFS for results.
Troubleshooting
- Pig command not found: Confirm that
PIG_HOMEandPATHare correctly set and your terminal session is refreshed. - Hadoop connection errors: Ensure Hadoop daemons are running and that
HADOOP_HOMEis set. - Permission issues: Check file and directory permissions on HDFS for input and output paths.
- Version compatibility: Use compatible versions of Hadoop and Pig to avoid runtime errors.
Summary Checklist
- ✔️ Hadoop installed and running
- ✔️ Java installed
- ✔️ Apache Pig downloaded and extracted
- ✔️ Environment variables set for Pig and Hadoop
- ✔️ Pig installation verified
- ✔️ Pig runs scripts successfully on Hadoop cluster
This tutorial helped you install and run Pig on Hadoop easily. For more advanced uses and optimizations, explore Pig’s official documentation and our article on How to Install Hive on Hadoop to expand your big data toolkit.
