Home > Hadoop, Oracle > How to Set Up a Hadoop Cluster Using Oracle Solaris

How to Set Up a Hadoop Cluster Using Oracle Solaris

About Hadoop and Oracle Solaris Zones

The Apache Hadoop software is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

The following are benefits of using Oracle Solaris Zones for a Hadoop cluster:

  • Fast provision of new cluster members using the zone cloning feature
  • Very high network throughput between the zones for data node replication
  • Optimized disk I/O utilization for better I/O performance with ZFS built-in compression
  • Secure data at rest using ZFS encryption

To store data, Hadoop uses the Hadoop Distributed File System (HDFS), which provides high-throughput access to application data and is suitable for applications that have large data sets. For more information about Hadoop and HDFS see http://hadoop.apache.org/.

The Hadoop cluster building blocks are as follows:

  • NameNode: The centerpiece of HDFS, which stores file system metadata, directs the slave DataNode daemons to perform the low-level I/O tasks, and also runs the JobTracker process.
  • Secondary NameNode: Performs internal checks of the NameNode transaction log.
  • DataNodes: Nodes that store the data in the HDFS file system, which are also known as slaves and run the TaskTracker process.

In the example presented in this article, all the Hadoop cluster building blocks will be installed using the Oracle Solaris Zones, ZFS, and network virtualization technologies. Figure 1 shows the architecture:

Figure 1

Figure 1. Architecture

Download and Install Hadoop

  1. To get a Hadoop distribution, download a recent stable release from one of the Apache download mirrors.For this article, I used the “12 October, 2012 Release 1.0.4” release.
  2. On the global zone, create the /usr/local directory if it doesn’t exist; later, we will share this directory with the zones.
    root@global_zone:~#  mkdir -p /usr/local
  3. Download the Hadoop tarball and copy it into /usr/local:
    root@global_zone:~#  cp /tmp/hadoop-1.0.4.tar.gz  /usr/local
  4. Unpack the tarball:
    root@global_zone:~# cd /usr/local
    root@global_zone:~# gzip -dc /usr/local/hadoop-1.0.4.tar.gz  | tar -xvf - 
  5. Create the Hadoop group:
    root@global_zone:~# groupadd hadoop
  6. Add the Hadoop user and set the user’s Hadoop password:
    root@global_zone:~# useradd -g hadoop hadoop
    root@global_zone:~# passwd hadoop
  7. Create the Hadoop user’s home directory:
    root@global_zone:~# mkdir -p /export/home/hadoop
    root@global_zone:~# chown hadoop:hadoop /export/home/hadoop/
  8. Rename the location of the Hadoop binaries and give ownership to the Hadoop user:
    root@global_zone:~# mv /usr/local/hadoop-1.0.4 /usr/local/hadoop
    root@global_zone:~# chown -R hadoop:hadoop /usr/local/hadoop*
  9. Edit the Hadoop configuration files, which are shown in Table 1:Table 1. Hadoop Configuration Files
    Filename Description
    hadoop-env.sh Specifies environment variable settings used by Hadoop.
    core-site.xml Specifies parameters relevant to all Hadoop daemons and clients.
    hdfs-site.xml Specifies parameters used by the HDFS daemons and clients.
    mapred-site.xml Specifies parameters used by the MapReduce daemons and clients.
    masters Contains a list of machines that run the Secondary NameNode.
    slaves Contains a list of machine names that run the DataNode and TaskTracker pair of daemons.

    To learn more about how the Hadoop framework is controlled by these configuration files, see http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/conf/Configuration.html.

    1. Run the following command to change to the conf directory:
    root@global_zone:~# cd /usr/local/hadoop/conf
  10. Edit the hadoop-env.sh file to uncomment and change the export lines so they look like the following:
    export JAVA_HOME=/usr/java
    export HADOOP_LOG_DIR=/var/log/hadoop
  11. Edit the masters file so that it looks like the following:
  12. Edit the slaves file so that it looks like the following:
  13. Edit the core-site.xml file so it looks like the following:
  14. Edit the hdfs-site.xml file so it looks like Listing 1:

    Listing 1. hdfs-site.xml File

  15. Edit the mapred-site.xml file so it looks like this:

Configure the Network Time Protocol

We should ensure that the system clock on the Hadoop zones is synchronized by using the Network Time Protocol (NTP). In this example, the global zone is configured as an NTP server.

Note: It is best to select an NTP server that can be a dedicated time synchronization source so that other services are not negatively affected if the machine is brought down for planned maintenance.

The following example shows how to configure an NTP server.

  1. Edit the NTP server configuration file, as shown in Listing 2:
    root@global_zone:~# grep -v ^# /etc/inet/ntp.conf
    server prefer
    broadcast ttl 4
    enable auth monitor
    driftfile /var/ntp/ntp.drift
    statsdir /var/ntp/ntpstats/
    filegen peerstats file peerstats type day enable
    filegen loopstats file loopstats type day enable
    filegen clockstats file clockstats type day enable
    keys /etc/inet/ntp.keys
    trustedkey 0
    requestkey 0
    controlkey 0
    root@global_zone:~# touch /var/ntp/ntp.drift

    Listing 2. NTP Server Configuration File

  2. Enable the NTP server service:
    root@global_zone:~# svcadm enable ntp
  3. Verify that the NTP server is online by using the following command:
    root@global_zone:~# svcs -a | grep ntp
    online         16:04:15 svc:/network/ntp:default

Create the Virtual Network Interfaces

Oracle Solaris Zones on the same system can benefit from very high network I/O throughput (up to four times faster) with very low latency compared to systems with, say, 1 Gb physical network connections. For a Hadoop cluster, this means that the DataNodes can replicate the HDFS blocks much faster.

For more information about network virtualization benchmarks, see “How to Control Your Application’s Network Bandwidth.”

Create a series of virtual network interfaces (VNICs) for the different zones:

root@global_zone:~# dladm create-vnic -l net0 name_node1
root@global_zone:~# dladm create-vnic -l net0 secondary_name1
root@global_zone:~# dladm create-vnic -l net0 data_node1
root@global_zone:~# dladm create-vnic -l net0 data_node2
root@global_zone:~# dladm create-vnic -l net0 data_node3

Create the NameNode Zones

  1. If you don’t already have a file system for the NameNode and Secondary NameNode zones, run the following command:
    root@global_zone:~# zfs create -o mountpoint=/zones rpool/zones
  2. Create the name-node zone, as shown in Listing 3:
    root@global_zone:~# zonecfg -z name-node
    Use 'create' to begin configuring a new zone.
    Zonecfg:name-node> create
    create: Using system default template 'SYSdefault'
    zonecfg:name-node> set autoboot=true
    zonecfg:name-node> set limitpriv="default,sys_time"
    zonecfg:name-node> set zonepath=/zones/name-node
    zonecfg:name-node> add fs
    zonecfg:name-node:fs> set dir=/usr/local/hadoop
    zonecfg:name-node:fs> set special=/usr/local/hadoop
    zonecfg:name-node:fs> set type=lofs
    zonecfg:name-node:fs> set options=[ro,nodevices]
    zonecfg:name-node:fs> end
    zonecfg:name-node> add net
    zonecfg:name-node:net> set physical=name_node1
    zonecfg:name-node:net> end
    zonecfg:name-node> verify
    zonecfg:name-node> exit

    Listing 3. Creating the name-node Zone

  3. Create the sec-name-node zone, as shown in Listing 4:
    root@global_zone:~# zonecfg -z sec-name-node
    Use 'create' to begin configuring a new zone.
    Zonecfg:sec-name-node> create
    create: Using system default template 'SYSdefault'
    zonecfg:sec-name-node> set autoboot=true
    zonecfg:sec-name-node> set limitpriv="default,sys_time"
    zonecfg:sec-name-node> set zonepath=/zones/sec-name-node
    zonecfg:sec-name-node> add fs
    zonecfg:sec-name-node:fs> set dir=/usr/local/hadoop
    zonecfg:sec-name-node:fs> set special=/usr/local/hadoop
    zonecfg:sec-name-node:fs> set type=lofs
    zonecfg:sec-name-node:fs> set options=[ro,nodevices]
    zonecfg:sec-name-node:fs> end
    zonecfg:sec-name-node>  add net
    zonecfg:sec-name-node:net> set physical=secondary_name1
    zonecfg:sec-name-node:net> end
    zonecfg:sec-name-node> verify
    zonecfg:sec-name-node> exit

    Listing 4. Creating the sec-name-node Zone

Set Up the DataNode Zones

In this step, we can leverage the integration between Oracle Solaris Zones virtualization technology and the ZFS file system that is built into Oracle Solaris.

Hadoop best practice is to use a separate hard disk for each DataNode. Therefore, every DataNode zone will have its own hard disk in order to provide better I/O distribution, as shown in Figure 2.

Figure 2

Figure 2. Separate Disk for Each DataNode

Table 2 shows a summary of the Hadoop zones configuration we will create:

Table 2. Zone Summary

Function Zone Name ZFS Mount Point VNIC Name IP Address
NameNode name-node /zones/name-node name_node1
Secondary NameNode sec-name-node /zones/sec-name-node secondary_name1
DataNode data-node1 /zones/data-node1 data_node1
DataNode data-node2 /zones/data-node2 data_node2
DataNode data-node3 /zones/data-node3 data_node3
  1. Get the hard disk names using the following command:
    root@global_zone:~# format
    Searching for disks...done
           0. c7t0d0 <LSI-MR9261-8i-2.50-135.97GB>
           1. c7t1d0 <LSI-MR9261-8i-2.50-135.97GB>
           2. c7t2d0 <LSI-MR9261-8i-2.50-135.97GB>
  2. Create a separate ZFS file system for each zone in order to provide better disk I/O performance:
    root@global_zone:~# zpool create -O compression=on data-node1-pool c7t0d0
    root@global_zone:~# zpool create -O compression=on data-node2-pool c7t1d0
    root@global_zone:~# zpool create -O compression=on data-node3-pool c7t2d0
    root@global_zone:~# zfs create -o mountpoint=/zones/data-node1 data-node1-pool/data-node1
    root@global_zone:~# zfs create -o mountpoint=/zones/data-node2 data-node2-pool/data-node2
    root@global_zone:~# zfs create -o mountpoint=/zones/data-node3 data-node3-pool/data-node3
  3. Change the directories’ permissions in order to avoid warring messages during zone creation:
    root@global_zone:~# chmod 700 /zones/data-node1
    root@global_zone:~# chmod 700 /zones/data-node2
    root@global_zone:~# chmod 700 /zones/data-node3
  4. Point the zonecfg zonepath property to the ZFS file systems that you created, as shown in Listing 5.
    root@global_zone:~# zonecfg -z data-node1
    Use 'create' to begin configuring a new zone.
    zonecfg:data-node1> create
    create: Using system default template 'SYSdefault'
    zonecfg:data-node1> set autoboot=true
    zonecfg:data-node1> set limitpriv="default,sys_time"
    zonecfg:data-node1> set zonepath=/zones/data-node1 
    zonecfg:data-node1> add fs
    zonecfg:data-node1:fs> set dir=/usr/local/hadoop
    zonecfg:data-node1:fs> set special=/usr/local/hadoop
    zonecfg:data-node1:fs> set type=lofs
    zonecfg:data-node1:fs> set options=[ro,nodevices]
    zonecfg:data-node1:fs> end
    zonecfg:data-node1> add net
    zonecfg:data-node1:net> set physical=data_node1
    zonecfg:data-node1:net> end
    zonecfg:data-node1> verify
    zonecfg:data-node1> commit
    zonecfg:data-node1> exit
    root@global_zone:~# zonecfg -z data-node2
    Use 'create' to begin configuring a new zone.
    zonecfg:data-node2> create
    create: Using system default template 'SYSdefault'
    zonecfg:data-node2> set autoboot=true
    zonecfg:data-node2> set limitpriv="default,sys_time"
    zonecfg:data-node2> set zonepath=/zones/data-node2  
    zonecfg:data-node2> add fs
    zonecfg:data-node2:fs> set dir=/usr/local/hadoop
    zonecfg:data-node2:fs> set special=/usr/local/hadoop
    zonecfg:data-node2:fs> set type=lofs
    zonecfg:data-node2:fs> set options=[ro,nodevices]
    zonecfg:data-node2:fs> end
    zonecfg:data-node2> add net
    zonecfg:data-node2:net> set physical=data_node2
    zonecfg:data-node2:net> end
    zonecfg:data-node2> verify
    zonecfg:data-node2> commit
    zonecfg:data-node2> exit
    root@global_zone:~# zonecfg -z data-node3
    Use 'create' to begin configuring a new zone.
    zonecfg:data-node3> create
    create: Using system default template 'SYSdefault'
    zonecfg:data-node3> set autoboot=true
    zonecfg:data-node3> set limitpriv="default,sys_time"
    zonecfg:data-node3> set zonepath=/zones/data-node3
    zonecfg:data-node3> add fs
    zonecfg:data-node3:fs> set dir=/usr/local/hadoop
    zonecfg:data-node3:fs> set special=/usr/local/hadoop
    zonecfg:data-node3:fs> set type=lofs
    zonecfg:data-node3:fs> set options=[ro,nodevices]
    zonecfg:data-node3:fs> end
    zonecfg:data-node3> add net
    zonecfg:data-node3:net> set physical=data_node3
    zonecfg:data-node3:net> end
    zonecfg:data-node3> verify
    zonecfg:data-node3> commit
    zonecfg:data-node3> exit

    Listing 5. Setting the zonecfg zonepath Property

  5. Now, install the name-node zone; later we will clone it in order to accelerate zone creation time.
    root@global_zone:~# zoneadm -z name-node install
    The following ZFS file system(s) have been created:
    Progress being logged to /var/log/zones/zoneadm.20130106T134835Z.name-node.install
           Image: Preparing at /zones/name-node/root.
  6. Boot the name-node zone and check the status of the zones we’ve created, as shown in Listing 6:
    root@global_zone:~# zoneadm -z name-node boot
    root@global_zone:~# zoneadm list -cv
      ID NAME             STATUS     PATH                          BRAND    IP
       0 global           running    /                             solaris shared
       1 name-node        running    /zones/name-node              solaris  excl
       - sec-name-node    configured /zones/sec-name-node          solaris  excl
       - data-node1       configured /zones/data-node1             solaris  excl
       - data-node2       configured /zones/data-node2             solaris  excl
       - data-node3       configured /zones/data-node3             solaris  excl
    root@global_zone:~# zlogin -C name-node

    Listing 6. Booting the name-node Zone

  7. Provide the zone host information by using the following configuration for the name-node zone:
    1. For the host name, use name-node.
    2. Ensure the network interface name_node1 has an IP address of
    3. Ensure the name service is based on your network configuration.In this article, we will use /etc/hosts for name resolution, so we won’t set up DNS for host name resolution.
  8. After finishing the zone setup, log in to the zone.
  9. Developing for Hadoop requires a Java programming environment. You can install Java Development Kit (JDK) 6 using the following command:
    root@name-node:~# pkg install  jdk-6
  10. Verify the Java installation:
    root@name-node:~# which java
    root@name-node:~# java -version
    java version "1.6.0_35"
    Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
    Java HotSpot(TM) Server VM (build 20.10-b01, mixed mode)
  11. Create a Hadoop user inside the name-node zone:
    root@name-node:~# groupadd hadoop
    root@name-node:~# useradd -g hadoop hadoop
    root@name-node:~# passwd hadoop
    root@name-node:~# mkdir -p /export/home/hadoop
    root@name-node:~# chown hadoop:hadoop /export/home/hadoop
  12. Configure an NTP client, as shown in the following example:
    1. Install the NTP package:
      root@name-node:~# pkg install ntp
    2. Edit the NTP client configuration file:
      root@name-node:~# grep -v ^# /etc/inet/ntp.conf
      server global_zone  prefer
      slewalways yes
      disable pll
    3. Enable the NTP service:
      root@name-node:~# svcadm enable ntp
  13. Add the Hadoop cluster members’ host names and IP addresses to /etc/hosts:
    root@name-node:~# cat /etc/hosts
    ::1             localhost       localhost loghost name-node sec-name-node data-node1 data-node2 data-node3

    Note: If you are using the global zone as an NTP server, you must also add its host name and IP address to /etc/hosts.

Set Up SSH

  1. Set up SSH key-based authentication for the Hadoop user on the name_node zone in order to enable password-less login to the Secondary DataNode and the DataNodes:
    root@name-node  # su - hadoop
    hadoop@name-node $ ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
    hadoop@name-node $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  2. Update $HOME/.profile:
    hadoop@name-node $ cat $HOME/.profile
    # Set JAVA_HOME 
    export JAVA_HOME=/usr/java
    # Add Hadoop bin/ directory to PATH
    export PATH=$PATH:/usr/local/hadoop/bin
  3. Check that Hadoop runs by typing the following command:
    hadoop@name-node $ hadoop version
    Hadoop 1.0.4
    Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290
    Compiled by hortonfo on Wed Oct  3 05:13:58 UTC 2012
    From source with checksum fe2baea87c4c81a2c505767f3f9b71f4
  4. Create a directory for the Hadoop log files:
    root@name-node:~# mkdir /var/log/hadoop
    root@name-node:~# chown hadoop:hadoop /var/log/hadoop
  5. From the global zone, run the following command to create the sec-name-node zone as a clone of name-node:
    root@global_zone:~# zoneadm -z name-node shutdown 
    root@global_zone:~# zoneadm -z sec-name-node clone name-node
  6. Boot the sec-name-node zone:
    root@global_zone:~# zoneadm -z sec-name-node boot
    root@global_zone:~# zlogin -C sec-name-node
  7. As we experienced previously, the system configuration tool is launched, so do the final configuration for sec-name-node zone:
    1. For the host name, use sec-name-node.
    2. For the network interface, use secondary_name1.
    3. Use an IP address of
    4. Ensure the name service is set to none.
  8. Perform similar steps for data-node1, data-node2, and data-node3:
    1. Do the following for data-node1:
      root@global_zone:~# zoneadm -z data-node1 clone name-node
      root@global_zone:~# zoneadm -z data-node1 boot
      root@global_zone:~# zlogin -C data-node1 
      • For the host name, use data-node1.
      • For the network interface, use data_node1.
      • Use an IP address of
      • Ensure the name service is set to none.
    2. Do the following for data-node2:
      root@global_zone:~# zoneadm -z data-node2 clone name-node
      root@global_zone:~# zoneadm -z data-node2 boot
      root@global_zone:~# zlogin -C data-node2
      • For the host name, use data-node2.
      • For the network interface, use data_node2.
      • Use an IP address of
      • Ensure the name service is set to none.
    3. Do the following for data-node3:
      root@global_zone:~# zoneadm -z data-node3  clone name-node
      root@global_zone:~# zoneadm -z data-node3 boot
      root@global_zone:~# zlogin -C data-node3
      • For the host name, use data-node3.
      • For the network interface, use data_node3.
      • Use an IP address of
      • Ensure the name service is set to none.
  9. Boot the name_node zone:
    root@global_zone:~#  zoneadm -z  name-node  boot
  10. Verify that all the zones are up and running:
    root@global_zone:~# zoneadm list -cv
      ID NAME             STATUS     PATH                          BRAND    IP
       0 global           running    /                             solaris shared
      10 sec-name-node    running    /zones/sec-name-node          solaris  excl
      12 data-node1       running    /zones/data-node1             solaris  excl
      14 data-node2       running    /zones/data-node2             solaris  excl
      16 data-node3       running    /zones/data-node3             solaris  excl
      17 name-node        running    /zones/name-node              solaris  excl

Verify the SSH Setup

To verify that you SSH access without using a password for the Hadoop user, do the following.

  1. From the name_node, log in via SSH into the name-node (that is, to itself):
    hadoop@name-node $ ssh name-node 
    The authenticity of host 'name-node (' can't be established.
    RSA key fingerprint is 04:93:a9:e0:b7:8c:d7:8b:51:b8:42:d7:9f:e1:80:ca.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added 'name-node,' (RSA) to the list of known hosts.
  2. Now, try to log in to the sec-name-node and the DataNodes. When you try to log in with SSH the second time, you shouldn’t get any prompt to add the host to the known keys list.

Verify Name Resolution

Verify that all the Hadoop zones have the following host entries in /etc/hosts:

# cat /etc/hosts

::1             localhost       localhost loghost name-node sec-name-node data-node1 data-node2 data-node3

Note: If you are using the global zone as an NTP server, you must also add its host name and IP address to /etc/hosts.

Format the HDFS File System from the NameNode

  1. Run the commands shown in Listing 7:
    root@name-node:~# mkdir -p /hdfs/name
    root@name-node:~# chown -R hadoop:hadoop /hdfs/
    root@name-node:~# su - hadoop
    hadoop@name-node:$ /usr/local/hadoop/bin/hadoop namenode -format
    13/01/06 20:00:32 INFO namenode.NameNode: STARTUP_MSG:
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = name-node/
    STARTUP_MSG:   args = [-format]
    STARTUP_MSG:   version = 1.0.4
    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; 
    compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012

    Listing 7. Formatting the HDFS File System

  2. On every DataNode (data-node1, data-node2, and data-node3), create a Hadoop data directory to store the HDFS blocks:
    root@data-node1:~# mkdir -p /hdfs/data
    root@data-node1:~# chown -R hadoop:hadoop /hdfs/
    root@data-node2:~# mkdir -p /hdfs/data
    root@data-node2:~# chown -R hadoop:hadoop /hdfs/
    root@data-node3:~# mkdir -p /hdfs/data
    root@data-node3:~# chown -R hadoop:hadoop /hdfs/

Start the Hadoop Cluster

Table 3 describes the startup scripts.

Table 3. Startup Scripts

Filename Description
start-dfs.sh Starts the Hadoop DFS daemons, the NameNode, and the DataNodes. Use this before start-mapred.sh.
stop-dfs.sh Stops the Hadoop DFS daemons.
start-mapred.sh Starts the Hadoop MapReduce daemons, the JobTracker, and TaskTrackers.
stop-mapred.sh Stops the Hadoop MapReduce daemons.
  1. From the name-node zone, start the Hadoop DFS daemons, the NameNode, and the DataNodes using the following command:
    hadoop@name-node:$ start-dfs.sh
    starting namenode, logging to /var/log/hadoop/hadoop--namenode-name-node.out
    data-node2: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-data-node2.out
    data-node1: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-data-node1.out
    data-node3: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-data-node3.out
    sec-name-node: starting secondarynamenode, logging to 
  2. Start the Hadoop Map/Reduce daemons, the JobTracker, and TaskTrackers using the following command:
    hadoop@name-node:$ start-mapred.sh
    starting jobtracker, logging to /var/log/hadoop/hadoop--jobtracker-name-node.out
    data-node1: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-data-node1.out
    data-node3: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-data-node3.out
    data-node2: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-data-node2.out
  3. To view a comprehensive status report, execute the command shown in Listing 8 to check the cluster status. The command will output basic statistics about the cluster health, such as NameNode details, the status of each DataNode, and disk capacity amounts.
    hadoop@name-node:$ hadoop dfsadmin -report
    Configured Capacity: 171455269888 (159.68 GB)
    Present Capacity: 169711053357 (158.06 GB)
    DFS Remaining: 169711028736 (158.06 GB)
    DFS Used: 24621 (24.04 KB)
    DFS Used%: 0%
    Under replicated blocks: 0
    Blocks with corrupt replicas: 0
    Missing blocks: 0
    Datanodes available: 3 (3 total, 0 dead)

    Listing 8. Checking the Cluster Status

    You can find the same information on the NameNode Web status page at http://<namenode IP address>:50070/dfshealth.jsp.

    Figure 3

    Figure 3. Cluster Summary

Run a MapReduce Job

MapReduce is a framework for processing parallelizable problems across huge data sets using a cluster of computers. For more information about MapReduce, see http://en.wikipedia.org/wiki/MapReduce.

We will use the WordCount example, which reads text files and counts how often words occur. The input and output consist of text files, each line of which contains a word and the number of times the word occurred, separated by a tab. For more information about WordCount, see http://wiki.apache.org/hadoop/WordCount.

  1. For the input file, download the following eBook from Project Gutenberg as a plain-text file with UTF-8 encoding, and store the file in a temporary directory of choice, for example /tmp/data: The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson.
  2. Copy the file to HDFS using the following command:
    hadoop@name-node:$ hadoop dfs -copyFromLocal /tmp/data/ /hdfs/data
  3. Verify that the file is located on HDFS:
    hadoop@name-node:$ hadoop dfs -ls /hdfs/data
    Found 1 items
    -rw-r--r--   3 hadoop supergroup     661664 2013-01-07 19:45 /hdfs/data/20417-8.txt
  4. Start the MapReduce job using the command shown in Listing 9:
    hadoop@name-node:$ hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar wordcount /hdfs/data /hdfs/data/data-output
    13/01/07 15:20:21 INFO input.FileInputFormat: Total input paths to process : 1
    13/01/07 15:20:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library 
    for your platform... using builtin-java classes where applicable
    13/01/07 15:20:21 WARN snappy.LoadSnappy: Snappy native library not loaded
    13/01/07 15:20:22 INFO mapred.JobClient: Running job: job_201301071307_0006
    13/01/07 15:20:23 INFO mapred.JobClient:  map 0% reduce 0%
    13/01/07 15:20:38 INFO mapred.JobClient:  map 100% reduce 0%
    13/01/07 15:20:50 INFO mapred.JobClient:  map 100% reduce 100%
    13/01/07 15:20:55 INFO mapred.JobClient: Job complete: job_201301071307_0006
    13/01/07 15:20:55 INFO mapred.JobClient: Counters: 26
    13/01/07 15:20:55 INFO mapred.JobClient:   Job Counters

    Listing 9. Starting the MapReduce Job

  5. Verify the output data:
    hadoop@name-node:$ hadoop dfs -ls /hdfs/data/data-output
    Found 3 items
    -rw-r--r--   3 hadoop supergroup          0 2013-01-07 15:20 /hdfs/data/data-output/_SUCCESS
    drwxr-xr-x   - hadoop supergroup          0 2013-01-07 15:20 /hdfs/data/data-output/_logs
    -rw-r--r--   3 hadoop supergroup     196288 2013-01-07 15:20 /hdfs/data/data-output/part-r-00000
  6. The output data can contain sensitive information, so use ZFS encryption to protect the output data:
    1. Create the encrypted ZFS data set:
      root@name-node:~# zfs create -o encryption=on rpool/export/output
      Enter passphrase for 'rpool/export/output':
      Enter again:
    2. Change the ownership:
      root@name-node:~# chown hadoop:hadoop /export/output/
    3. Copy the output file from the Hadoop HDFS into ZFS:
      root@name-node:~# su - hadoop
      Oracle Corporation      SunOS 5.11      11.1    September 2012
      hadoop@name-node:$ hadoop dfs -getmerge /hdfs/data/data-output /export/output/
  7. Analyze the output text file. Each line contains a word and the number of times the word occurred, separated by a tab.
    hadoop@name-node:$ head /export/output/data-output
    "A      2
    "Alpha  1
    "Alpha,"        1
    "An     2
    "And    1
    "BOILING"       2
    "Batesian"      1
    "Beta   2
  8. Protect the output text file by unmounting the ZFS data set, and then unload the wrapping key for an encrypted data set using the following command:
    root@name-node:~# zfs key -u rpool/export/output

    If the command is successful, the data set is not accessible and it is unmounted.

    If you want to mount this ZFS file system, you need to provide the passphrase:

    root@name-node:~# zfs mount rpool/export/output
    Enter passphrase for 'rpool/export/output':

    By using a passphrase, you ensure that only those who know the passphrase can observe the output file. For more information about ZFS encryption, see “How to Manage ZFS Data Encryption.”

  1. April 27, 2014 at 8:53 am

    This is really interesting, You’re a very skilled blogger.
    I have joined your rss feed and look forward to seeking more of your wonderful post.
    Also, I’ve shared your site in my social networks!

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: