How to Install Pig on Hadoop: A Step-by-Step Guide

Apache Pig is a high-level platform used to create programs that run on Apache Hadoop. It simplifies the process of writing complex MapReduce jobs by providing a scripting language called Pig Latin that abstracts the complexities of the underlying MapReduce tasks.

Prerequisites

Installed and configured Apache Hadoop cluster (local or distributed) with Hadoop services running.
Java installed on your system (Java 8 or later recommended).
Basic knowledge of Hadoop ecosystem and command line usage.
Internet connection to download Pig binaries.

Step 1: Check Your Hadoop Installation

Before installing Pig, verify that Hadoop is properly installed and running:

hadoop version

This command should return the version of Hadoop along with other details. Also, make sure the Hadoop daemons (NameNode, DataNode, ResourceManager, NodeManager) are running.

Step 2: Download Apache Pig

Visit the official Apache Pig (Official site) page and download the latest stable release of Pig in binary format.

Alternatively, you can use wget on your Hadoop server:

wget https://downloads.apache.org/pig/pig-x.y.z/pig-x.y.z.tar.gz

Replace x.y.z with the latest version number.

Step 3: Extract and Configure Pig

Extract the downloaded tarball:

tar -xvzf pig-x.y.z.tar.gz

Move the extracted folder to a suitable directory, for example:

sudo mv pig-x.y.z /usr/local/pig

Set Environment Variables

Edit your shell profile (e.g., ~/.bashrc or ~/.bash_profile) and add the following lines:

export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin

Apply the changes:

source ~/.bashrc

Step 4: Verify Pig Installation

Run:

pig -version

This should output the installed Pig version.

Step 5: Configure Pig to Use Hadoop

By default, Pig runs on Hadoop in MapReduce mode. Ensure that your HADOOP_HOME is set and that Pig’s environment can access Hadoop libraries.

Check your environment variables for Hadoop:

echo $HADOOP_HOME

If not set, add it in your shell profile similarly to the Pig environment variables.

Step 6: Running a Pig Script

Create a simple Pig script called wordcount.pig:

-- Load data from HDFS
words = LOAD '/input/textfile.txt' AS (line:chararray);

-- Split lines into words
wordlist = FOREACH words GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- Group the words
grouped = GROUP wordlist BY word;

-- Count the words
wordcount = FOREACH grouped GENERATE group, COUNT(wordlist);

-- Store the results
STORE wordcount INTO '/output/wordcount';

Run your script on the cluster:

pig wordcount.pig

After execution, check the output directory in HDFS for results.

Troubleshooting

Pig command not found: Confirm that PIG_HOME and PATH are correctly set and your terminal session is refreshed.
Hadoop connection errors: Ensure Hadoop daemons are running and that HADOOP_HOME is set.
Permission issues: Check file and directory permissions on HDFS for input and output paths.
Version compatibility: Use compatible versions of Hadoop and Pig to avoid runtime errors.

Summary Checklist

✔️ Hadoop installed and running
✔️ Java installed
✔️ Apache Pig downloaded and extracted
✔️ Environment variables set for Pig and Hadoop
✔️ Pig installation verified
✔️ Pig runs scripts successfully on Hadoop cluster

This tutorial helped you install and run Pig on Hadoop easily. For more advanced uses and optimizations, explore Pig’s official documentation and our article on How to Install Hive on Hadoop to expand your big data toolkit.