Hadoop October 06, 2014

Install Hadoop 2.5.1 on Windows 7 - 64Bit Operating System

This post is about installing Single Node Cluster Hadoop 2.5.1 (latest stable version) on Windows 7 Operating Systems.Hadoop was primarily designed for Linux platform. Hadoop supports for windows from its version 2.2, but we need prepare our platform binaries. Hadoop official website recommend Windows developers to use this build for development environment and not on production, since it is not completely tested success on Windows platform. This post describes the procedure for generating the Hadoop build for Windows platform.

Generating Hadoop Build For Windows Platform

Step 1:Install Microsoft Windows SDK 7.1

In my case, I have used Windows 7 64 bit Operating System. Download Microsoft Windows SDK 7.1 from Microsoft Official website and install it.
While installing Windows SDK,I have faced problem like C++ 2010 Redistribution is already installed. This problem will happen only if we have installed C++ 2010 Redistribution of higher version compared to the Windows SDK.
We can solve this issue by either not installing the C++ 2010 Redistribution by unchecked the Windows SDK on custom component selection or uninstalling from Control Panel and reinstalling the C++ 2010 Redistribution via Windows SDK again.

Step 2:Install Oracle Java JDK 1.7

I recommend everybody to download Oracle Java JDK7 and install JDK at C:\Java\ instead of default path C:\Programming Files\Java\. Since this default path contains invalid space character between “Programming” “Files”.
Now we need to configure JAVA_HOME Environment Variable with value “C:\Java\jdk1.7.0_51”. If we have already installed Java at it's default path (C:\Programming Files\Java\). We need to find its 8.3 pathname with the helpof “dir /X” command from its parent directory. The sample 8.3 pathname will look like "C:\PROGRA~1\Java\jdk1.7.0_51".
Finally we need to add Java bin path to PATH enviroment variable as “%JAVA_HOME%\bin”.

Step 3:Install Maven 3.2.1

Download latest Apache Maven from its official website and extract to C:\maven. Configure the M2_HOME Enviroment Variable with maven home directory path “C:\maven”.
Finally add the maven bin path to PATH Environment variable as “%M2_HOME%\bin".

Step 4: Install Protocol buffer 2.5.0

Download binary version of Protocol Buffer from it official website and extract it to “C:\protobuf” directory and add this path to PATH Environment Variable.

Step 5: Install Cygwin

Download the latest version of Cygwin from its official website and install at "C:\cygwin64" with ssh, sh packages.
Finally add the cygwin bin path to PATH environment variable.

Step 6: Install cmake 3.0.2

Download the latest cmake from its official website and install it normally.

Step 7: Configure “Platform” Environment Varibale.

Add the “Platform” environment variable with the value of either “x64” or “Win32” for buildin on 64-bit or 32-bit system.(Case-sensitive)

Step 8:Create Hadoop Build

Download the latest stable version of Hadoop source from its official website and extract it to “C:\hdc”. Now we can generate Hadoop Windows Build by executing the following command on Windows SDK Command Prompt.
```
mvn package -Pdist,native-win -DskipTests -Dtar
```
The above command will run for approx 30 min and output the Hadoop Windows build at C:\hdc\hadoop-dist\target” directory.

Configuring Hadoop for Single Node(Pseudo Distributed) Cluster

Step 1:Extract Hadoop

Copy the Hadoop Windows Build tar.gz file from “C:\hdc\hadoop-dist\target” and extract at “C:\hadoop”.

Step 2: Configure hadoop-env.cmd

Edit the “C:\hadoop\etc\hadoop\hadoop-env.cmd” file and add the following lines at the end of the file. The following lines will configure the Hadoop and Yarn Configuration Directories.
```
set HADOOP_PREFIX=c:\deploy
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
```

Step 3: Configure core-site.xml

Edit the “C:\hadoop\etc\hadoop\core-site.xml” file and configure the following property.

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://0.0.0.0:19000</value>
    </property>
</configuration>

Step 4: Configure hdfs-site.xml

Edit the “C:\hadoop\etc\hadoop\hdfs-site.xml” file and configure the following property.

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Step 5: Configure mapred-site.xml

Edit the “C:\hadoop\etc\hadoop\mapred-site.xml” file and configure the following property.

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:54311</value>
    </property>
</configuration>

Step 5: Create tmp directory

Create a tmp directory as “C:\tmp”, where “C:\tmp” is the default temperory directory for Hadoop.

Step 6: Execute hadoop-env.cmd

Execute the “C:\hadoop\etc\hadoop\hadoop-env.cmd” file from the Command Prompt to set the Environment Varibales.

Step 7: Format File System

Format the file sytem by executing the following command before first time usage.
```
%HADOOP_PREFIX%\bin\hdfs namenode -format
```

Step 8: Start HDFS

Execute the following command to start the HDFS.
```
%HADOOP_PREFIX%\sbin\start-dfs.cmd
```

Step 9: Check via Web-browser

Open the browser with address http://localhost:50070, this page will display the currently running nodes and we can browse the HDFS also on this portal.