Java: Passing Array to Oracle Stored Procedure

This tutorial guides us on how to pass Array objects from Java to stored procedures in Oracle and also, how to retrieve an array object in Java.

All PLSQL arrays can not be called from java. An array needs to be created as TYPE, at SCHEMA level in the database and then it can be used with ArrayDescriptor in Java, as oracle.sql.ArrayDescriptor class in Java can not access at package level.

Database Code

First, Create an array, at SCHEMA level. An example is shown below:

CREATE TYPE array_table AS TABLE OF VARCHAR2 (50); -- Array of String
 
CREATE TYPE array_int AS TABLE OF NUMBER;          -- Array of integers

Next, Create a procedure which takes an array as an input parameter and returns an array as its OUT parameter.

An example of one such procedure is shown below, which has 2 parameters –

  1. an array of String as its IN parameter – p_array
  2. an array of Integers as OUT parameter – p_arr_int
CREATE OR REPLACE PROCEDURE SchemaName.proc1 (p_array     IN     array_table,
                                              len            OUT NUMBER,
                                              p_arr_int      OUT array_int)
AS
   v_count   NUMBER;
BEGIN
   p_arr_int := NEW array_int ();
   p_arr_int.EXTEND (10);
   len := p_array.COUNT;
   v_count := 0;
 
   FOR i IN 1 .. p_array.COUNT
   LOOP
      DBMS_OUTPUT.put_line (p_array (i));
      p_arr_int (i) := v_count;
      v_count := v_count + 1;
   END LOOP;
END;
/

After this, Execution permission would be required to execute the procedure created by you:

GRANT EXECUTE ON SchemaNAme.proc1 TO UserName;

Java Code

Create a java class which makes a call to the procedure proc1, created before.

Below is an example which contains the whole flow from creating a connection with the database, to making a call to the stored procedure, passing an array to Oracle procedure, retrieving an array from an Oracle procedure and displaying the result.

import java.math.BigDecimal;
import java.sql.CallableStatement;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Types;
import oracle.jdbc.OracleCallableStatement;
import oracle.jdbc.internal.OracleTypes;
import oracle.sql.ARRAY;
import oracle.sql.ArrayDescriptor;
 
public class TestDatabase {
     
    public static void passArray()
    {
        try{
         
            Class.forName("oracle.jdbc.OracleDriver");         
            Connection con = DriverManager.getConnection("jdbc:oracle:thin:url ","UserName","Password");;
             
            String array[] = {"one", "two", "three","four"};
             
            ArrayDescriptor des = ArrayDescriptor.createDescriptor("SchemaName.ARRAY_TABLE", con);
            ARRAY array_to_pass = new ARRAY(des,con,array);
             
            CallableStatement st = con.prepareCall("call SchemaName.proc1(?,?,?)");
 
            // Passing an array to the procedure -
            st.setArray(1, array_to_pass);
 
            st.registerOutParameter(2, Types.INTEGER);
            st.registerOutParameter(3,OracleTypes.ARRAY,"SchemaName.ARRAY_INT");
            st.execute();
             
            System.out.println("size : "+st.getInt(2));
 
            // Retrieving array from the resultset of the procedure after execution -
            ARRAY arr = ((OracleCallableStatement)st).getARRAY(3);
             BigDecimal[] recievedArray = (BigDecimal[])(arr.getArray());
 
            for(int i=0;i<recievedArray.length;i++)
                System.out.println("element" + i + ":" + recievedArray[i] + "\n");
             
        } catch(Exception e) {
            System.out.println(e);
        }
    }
 
    public static void main(String args[]){
        passArray();
    }
}

Brief Explanations:

  1. Class.forName() – Returns the Class object associated with the class or interface with the given string name.
  2. DriverManager.getConnection() – Attempts to establish a connection to the given database URL.
  3. oracle.sql.ArrayDescriptor – Describes an array class
  4. ArrayDescriptor.createDescriptor() – Descriptor factory. Lookup the name in the database, and determine the characteristics of this array.
  5. oracle.sql.ARRAY – An Oracle implementation for generic JDBC Array interface.
  6. CallableStatement – The interface used to execute SQL stored procedures.

References

Advertisements

Oracle 12c database creation using DBCA

Oracle 12c database creation using DBCA

Multitenant Architecture for Database Consolidation

Container databases are part of the 12c new feature called Multitenant Architecture.

The concept is that you have a Container database (CDB) at the root or top level which is in turn comprised of a number of pluggable databases (PDB).

Simplistically put, the CDB has the Oracle-supplied metadata like the Data Dictionary which is common to all the PDB’s in that Container database and each individual PDB contains the application data. We have the concept of common users which are common to all PDB’s in the CDB and each PDB can have its own distinct and individual users.

The rationale behind the muiltitenant architecture introduced in 12c is that instead of having hundreds of servers hosting hundreds of databases usually on different platforms and where in most cases the resources on each server are not fully utilized, go for database consolidation with the Container Database topology.

This hardware consolidation and sharing of the database background processes, memory, files etc reduce costs for hardware, storage, availability, as well as the administrative labor cost.

Administrative tasks like patch application and database upgrades are significantly less costly to the business and time consuming for the DBA’s because the patch or upgrade happens at the container level and not the individual database level.

Enterprise Manager Database Express

Enterprise Manager Database Control is no longer available in Oracle Database 12c. It has been replaced by Oracle Enterprise Manager Database Express.

It is scaled down version of Database Control and is an extremely light weight web-based tool for performing basic administration tasks and performance management. There is now a consolidation of several of the performance screens seen in the previous Database Control into what is referred to now as the database Performance Hub.

If we select the Advanced Mode in the Creation Mode screen, we will see the screens as shown below:


12c Release 1 database software installation on Linux

Here is a quick look at the installation of the 12c database software on an Oracle virtual machine running Oracle Enterprise Linux 5.8

Most of the screens in the installation are very similar to 11g R2, and I have pointed out some of the new things we will see when we do the 12c install.

If we select the option to create and configure a database instead then we are presented with a new screen

The System Class screen gives us the option of Desktop or Server class.

If we select the Desktop class, it is appropriate if we are doing a test install on a PC desktop or laptop and a number of deployment options like ASM, RAC, backup and recovery configuration, integration with OEM Cloud Control etc are not available in this option.

Some new groups have been added to the standard OSDBA and OSOPER groups which we had in earlier versions.

For example the OSDGDBA group is if we want to have a separate group of OS users for administering or monitoring databases in the Oracle Data Guard configuration or the group OSKMDBA group which relates to privileges around encryption key management like Oracle Wallet Manager.

Oracle Multithreaded : does it worth a try ?

There are still some applications (like those written in PHP or your old home-designed applications) that don’t use “application server connection pooling”. Those applications may be very session-consuming which means process-consuming in our case.

As you know, the more processes, the more context switches. This has a performance cost that you should always avoid.

There are a few solutions to your problem (in historical order):

Solution

Description Pros Cons
Shared Servers You set up dispatchers that pool sessions on processes at database level Works since 8i.Works great for long running transactions. Not efficient with short database activity sessions.
Database Resident Connection Pooling (DRCP) DRCP pools database server processes and sessions which are shared across connections from multiple application processes (4) Works since 11g.Pools can be shared by applications (1).Very Efficient. Doesn’t pool background processes (more than 40 per instance in 12c).Not supported by all connection drivers (2).You have to configure the client’s connection side.
Multithreaded configuration You configure your database to group Oracle processes in system processes using threads (3). No application change.No limitations.Pool all the processes (background and foreground). Not available before DB12C.
(1)    If you have 10 application servers each handling 20 connections in a pool but only 25 active sessions at any time at database level, you better set up a DRCP of 30 processes that service all this connections
(2)    Oracle Database should be 11g or above. PHP OCI driver should be OCI8 1.3 or higher.
(3)    Windows Platform works in a multithreaded mode in all Oracle Database versions, the real change is for Unix/Linux Platforms since Database 12c
(4)    Extract from http://www.oracle.com/technetwork/articles/oracledrcp11g-1-133381.pdf

You may find that DRCP helps more improving performance that Multithreaded. But that’s not the point. First there are situations where you won’t be able to use DRCP and moreover you can use both of it (even if I doubt multithreaded will make a difference on this last case – I haven’t tested it).

How do you set it up?

This is the easiest part of the job, there’s only one parameter to change at instance level and one line to add at listener level.

First let’s have a look at background processes on an “out of the box” DB12c environment (Oracle Enterprise Linux 6 x64):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
[oracle@oel6-db12C-tbo ~]$ ps -ef|grep MYCDB
oracle    2736     1  0 11:18 ?        00:00:00 ora_pmon_MYCDB
oracle    2738     1  0 11:18 ?        00:00:00 ora_psp0_MYCDB
oracle    2740     1 27 11:18 ?        00:00:40 ora_vktm_MYCDB
oracle    2744     1  0 11:18 ?        00:00:00 ora_gen0_MYCDB
oracle    2746     1  0 11:18 ?        00:00:00 ora_mman_MYCDB
oracle    2750     1  0 11:18 ?        00:00:00 ora_diag_MYCDB
oracle    2752     1  0 11:18 ?        00:00:00 ora_dbrm_MYCDB
oracle    2754     1  0 11:18 ?        00:00:00 ora_dia0_MYCDB
oracle    2756     1  0 11:18 ?        00:00:00 ora_dbw0_MYCDB
oracle    2758     1  0 11:18 ?        00:00:00 ora_lgwr_MYCDB
oracle    2760     1  0 11:18 ?        00:00:00 ora_ckpt_MYCDB
oracle    2762     1  0 11:18 ?        00:00:00 ora_smon_MYCDB
oracle    2764     1  0 11:18 ?        00:00:00 ora_reco_MYCDB
oracle    2766     1  0 11:18 ?        00:00:00 ora_lreg_MYCDB
oracle    2768     1  1 11:18 ?        00:00:02 ora_mmon_MYCDB
oracle    2770     1  0 11:18 ?        00:00:00 ora_mmnl_MYCDB
oracle    2772     1  0 11:18 ?        00:00:00 ora_d000_MYCDB
oracle    2774     1  0 11:18 ?        00:00:00 ora_s000_MYCDB
oracle    2776     1  0 11:18 ?        00:00:00 ora_n000_MYCDB
oracle    2788     1  0 11:18 ?        00:00:00 ora_tmon_MYCDB
oracle    2790     1  0 11:18 ?        00:00:00 ora_tt00_MYCDB
oracle    2792     1  0 11:18 ?        00:00:00 ora_smco_MYCDB
oracle    2794     1  0 11:18 ?        00:00:00 ora_aqpc_MYCDB
oracle    2798     1  3 11:18 ?        00:00:03 ora_p000_MYCDB
oracle    2800     1  4 11:18 ?        00:00:06 ora_p001_MYCDB
oracle    2802     1  0 11:18 ?        00:00:00 ora_p002_MYCDB
oracle    2804     1  0 11:18 ?        00:00:00 ora_w000_MYCDB
oracle    2806     1  0 11:18 ?        00:00:00 ora_p003_MYCDB
oracle    2808     1  1 11:18 ?        00:00:01 ora_cjq0_MYCDB
oracle    2830     1  0 11:18 ?        00:00:00 ora_qm02_MYCDB
oracle    2834     1  0 11:18 ?        00:00:00 ora_q002_MYCDB
oracle    2836     1  0 11:18 ?        00:00:00 ora_q003_MYCDB
oracle    2848     1  0 11:18 ?        00:00:00 ora_p004_MYCDB
oracle    2850     1  0 11:18 ?        00:00:00 ora_p005_MYCDB
oracle    2852     1  0 11:18 ?        00:00:00 ora_p006_MYCDB
oracle    2854     1  0 11:18 ?        00:00:00 ora_p007_MYCDB

There are 33 processes (without any user connection).

Let’s switch to “Multithreaded” mode :

First we modify the “thread_execution” parameter:

1
2
3
4
5
6
7
8
9
SQL> show parameter threaded_execution
NAME                                                         TYPE  VALUE
------------------------------------ ----------- ------------------------------
threaded_execution                                boolean            FALSE
SQL> alter system set threaded_execution=TRUE scope=spfile;

A restart of the instance is mandatory.

Next we configure the listener (one of its jobs is to spawn processes when a connection is requested), adding the last line to the configuration file (listener.ora):

# listener.ora Network Configuration File: /u01/app/oracle/product/12.1.0/dbhome_1/network/admin/listener.ora

# Generated by Oracle configuration tools.

LISTENER =
  (DESCRIPTION_LIST =
    (DESCRIPTION =
      (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1521))
      (ADDRESS = (PROTOCOL = TCP)(HOST = oel6-db12C-tbo)(PORT = 1521))
    )
  )
DEDICATED_THROUGH_BROKER_LISTENER=ON

A restart of the listener is mandatory.

That’s all.

Now let’s see how it impacts the operating system:

1
2
3
4
5
6
7
8
9
10
11
12
13
[oracle@oel6-db12C-tbo admin]$ ps -ef|grep MYCDB
oracle    5225     1  0 11:45 ?        00:00:00 ora_pmon_MYCDB
oracle    5227     1  0 11:45 ?        00:00:00 ora_psp0_MYCDB
oracle    5229     1 30 11:45 ?        00:00:10 ora_vktm_MYCDB
oracle    5233     1  2 11:45 ?        00:00:00 ora_u004_MYCDB
oracle    5238     1 53 11:45 ?        00:00:18 ora_u005_MYCDB
oracle    5245     1  0 11:45 ?        00:00:00 ora_dbw0_MYCDB

There are only 6 processes (instead of 33 before the change).

 

A few things to know about “multithreaded configurations”

You can’t use OS authentication anymore:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[oracle@oel6-db12C-tbo admin]$ sqlplus / as sysdba
SQL*Plus: Release 12.1.0.1.0 Production on Fri Sep 13 11:48:05 2013
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
ERROR:
ORA-01017: invalid username/password; logon denied
 [oracle@oel6-db12C-tbo admin]$
[oracle@oel6-db12C-tbo admin]$ sqlplus sys as sysdba
SQL*Plus: Release 12.1.0.1.0 Production on Fri Sep 13 11:48:13 2013
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
Enter password:
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL>

This is the expected behavior, not a bug.

Do not kill sessions using Unix “kill” command to kill OS processes, sessions are grouped into processes where internal and user sessions are mixed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
SQL> select s.username,sid as session_id, spid as process_id, stid as thread_id,
pname as process_name from v$process p, v$session s  WHERE  s.paddr = p.addr order by process_id,
thread_id; (results are filtered on spid 10959);
username     sid process_id   thread_id     process_name          
------------ --- ---------------- ----------------- ----          
            1018 10959            10959             SCMN          
             765 10959            10961             DIAG          
               2 10959            10963             DIA0          
             257 10959            10971             RECO          
             767 10959            10973             MMON          
            1020 10959            10974             MMNL          
             514 10959            10984             TMON          
             768 10959            10985             TT00          
            1021 10959            10986             SMCO          
            1275 10959            10987             FBDA          
               4 10959            10988             AQPC          
             258 10959            10989             W000          
             515 10959            11015             CJQ0          
             771 10959            11030             QM02          
            1277 10959            11032             Q002          
               6 10959            11033             Q003          
SYS          510 10959            11093                           
            1029 10959            11129             W001          
             268 10959            11182             W002          
            1023 10959            11193             W003          
SOE          521 10959            11251                           
SOE          775 10959            11252                           
SOE         1279 10959            11253                           
SOE            9 10959            11254                           
SOE          262 10959            11255                           
SOE          516 10959            11256                           
SOE          770 10959            11257                           
SOE         1028 10959            11258                           
SOE         1283 10959            11259                           
SOE            8 10959            11260

How does it impact performance?

In order to compare both configurations (with and without multithreaded enabled) we used the well-known “Swingbench” tool (which does not support DRCP by the way).

We did several tests that can be resumed in these reports (the X-axis is the number of user sessions):

process

The number of processes stays low in multithreaded mode: no more than 17 for 400 user sessions.

You don’t see any data in “normal mode” for more than 300 users, this is because it hanged after that. We had the same problem with the multithreaded mode but with more than 400 hundred sessions which means that we are 25% more scalable.

transactions per second

In multithreaded mode we can support more transactions per second and without any saturation effect, again we are more scalable.

How to Set Up a Hadoop Cluster Using Oracle Solaris

About Hadoop and Oracle Solaris Zones

The Apache Hadoop software is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

The following are benefits of using Oracle Solaris Zones for a Hadoop cluster:

  • Fast provision of new cluster members using the zone cloning feature
  • Very high network throughput between the zones for data node replication
  • Optimized disk I/O utilization for better I/O performance with ZFS built-in compression
  • Secure data at rest using ZFS encryption

To store data, Hadoop uses the Hadoop Distributed File System (HDFS), which provides high-throughput access to application data and is suitable for applications that have large data sets. For more information about Hadoop and HDFS see http://hadoop.apache.org/.

The Hadoop cluster building blocks are as follows:

  • NameNode: The centerpiece of HDFS, which stores file system metadata, directs the slave DataNode daemons to perform the low-level I/O tasks, and also runs the JobTracker process.
  • Secondary NameNode: Performs internal checks of the NameNode transaction log.
  • DataNodes: Nodes that store the data in the HDFS file system, which are also known as slaves and run the TaskTracker process.

In the example presented in this article, all the Hadoop cluster building blocks will be installed using the Oracle Solaris Zones, ZFS, and network virtualization technologies. Figure 1 shows the architecture:

Figure 1

Figure 1. Architecture

Download and Install Hadoop

  1. To get a Hadoop distribution, download a recent stable release from one of the Apache download mirrors.For this article, I used the “12 October, 2012 Release 1.0.4” release.
  2. On the global zone, create the /usr/local directory if it doesn’t exist; later, we will share this directory with the zones.
    root@global_zone:~#  mkdir -p /usr/local
  3. Download the Hadoop tarball and copy it into /usr/local:
    root@global_zone:~#  cp /tmp/hadoop-1.0.4.tar.gz  /usr/local
  4. Unpack the tarball:
    root@global_zone:~# cd /usr/local
    root@global_zone:~# gzip -dc /usr/local/hadoop-1.0.4.tar.gz  | tar -xvf - 
  5. Create the Hadoop group:
    root@global_zone:~# groupadd hadoop
  6. Add the Hadoop user and set the user’s Hadoop password:
    root@global_zone:~# useradd -g hadoop hadoop
    root@global_zone:~# passwd hadoop
  7. Create the Hadoop user’s home directory:
    root@global_zone:~# mkdir -p /export/home/hadoop
    root@global_zone:~# chown hadoop:hadoop /export/home/hadoop/
  8. Rename the location of the Hadoop binaries and give ownership to the Hadoop user:
    root@global_zone:~# mv /usr/local/hadoop-1.0.4 /usr/local/hadoop
    root@global_zone:~# chown -R hadoop:hadoop /usr/local/hadoop*
  9. Edit the Hadoop configuration files, which are shown in Table 1:Table 1. Hadoop Configuration Files
    Filename Description
    hadoop-env.sh Specifies environment variable settings used by Hadoop.
    core-site.xml Specifies parameters relevant to all Hadoop daemons and clients.
    hdfs-site.xml Specifies parameters used by the HDFS daemons and clients.
    mapred-site.xml Specifies parameters used by the MapReduce daemons and clients.
    masters Contains a list of machines that run the Secondary NameNode.
    slaves Contains a list of machine names that run the DataNode and TaskTracker pair of daemons.

    To learn more about how the Hadoop framework is controlled by these configuration files, see http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/conf/Configuration.html.

    1. Run the following command to change to the conf directory:
    root@global_zone:~# cd /usr/local/hadoop/conf
  10. Edit the hadoop-env.sh file to uncomment and change the export lines so they look like the following:
    export JAVA_HOME=/usr/java
    export HADOOP_LOG_DIR=/var/log/hadoop
  11. Edit the masters file so that it looks like the following:
    sec-name-node
  12. Edit the slaves file so that it looks like the following:
    data-node1
    data-node2
    data-node3
  13. Edit the core-site.xml file so it looks like the following:
    <configuration>
         <property>
             <name>fs.default.name</name>
             <value>hdfs://name-node</value>
         </property>
    </configuration>
  14. Edit the hdfs-site.xml file so it looks like Listing 1:
    <configuration>
    <property>
        <name>dfs.data.dir</name>
        <value>/hdfs/data/</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/hdfs/name/</value>
    </property>
      <property>
        <name>dfs.replication</name>
        <value>3</value>
      </property>
    </configuration>

    Listing 1. hdfs-site.xml File

  15. Edit the mapred-site.xml file so it looks like this:
    <configuration>
         <property>
             <name>mapred.job.tracker</name>
             <value>name-node:8021</value>
         </property>
    </configuration>

Configure the Network Time Protocol

We should ensure that the system clock on the Hadoop zones is synchronized by using the Network Time Protocol (NTP). In this example, the global zone is configured as an NTP server.

Note: It is best to select an NTP server that can be a dedicated time synchronization source so that other services are not negatively affected if the machine is brought down for planned maintenance.

The following example shows how to configure an NTP server.

  1. Edit the NTP server configuration file, as shown in Listing 2:
    root@global_zone:~# grep -v ^# /etc/inet/ntp.conf
    
    server 127.127.1.0 prefer
    broadcast 224.0.1.1 ttl 4
    enable auth monitor
    driftfile /var/ntp/ntp.drift
    statsdir /var/ntp/ntpstats/
    filegen peerstats file peerstats type day enable
    filegen loopstats file loopstats type day enable
    filegen clockstats file clockstats type day enable
    keys /etc/inet/ntp.keys
    trustedkey 0
    requestkey 0
    controlkey 0
    
    root@global_zone:~# touch /var/ntp/ntp.drift

    Listing 2. NTP Server Configuration File

  2. Enable the NTP server service:
    root@global_zone:~# svcadm enable ntp
  3. Verify that the NTP server is online by using the following command:
    root@global_zone:~# svcs -a | grep ntp
    online         16:04:15 svc:/network/ntp:default

Create the Virtual Network Interfaces

Oracle Solaris Zones on the same system can benefit from very high network I/O throughput (up to four times faster) with very low latency compared to systems with, say, 1 Gb physical network connections. For a Hadoop cluster, this means that the DataNodes can replicate the HDFS blocks much faster.

For more information about network virtualization benchmarks, see “How to Control Your Application’s Network Bandwidth.”

Create a series of virtual network interfaces (VNICs) for the different zones:

root@global_zone:~# dladm create-vnic -l net0 name_node1
root@global_zone:~# dladm create-vnic -l net0 secondary_name1
root@global_zone:~# dladm create-vnic -l net0 data_node1
root@global_zone:~# dladm create-vnic -l net0 data_node2
root@global_zone:~# dladm create-vnic -l net0 data_node3

Create the NameNode Zones

  1. If you don’t already have a file system for the NameNode and Secondary NameNode zones, run the following command:
    root@global_zone:~# zfs create -o mountpoint=/zones rpool/zones
  2. Create the name-node zone, as shown in Listing 3:
    root@global_zone:~# zonecfg -z name-node
    Use 'create' to begin configuring a new zone.
    Zonecfg:name-node> create
    create: Using system default template 'SYSdefault'
    zonecfg:name-node> set autoboot=true
    zonecfg:name-node> set limitpriv="default,sys_time"
    zonecfg:name-node> set zonepath=/zones/name-node
    zonecfg:name-node> add fs
    zonecfg:name-node:fs> set dir=/usr/local/hadoop
    zonecfg:name-node:fs> set special=/usr/local/hadoop
    zonecfg:name-node:fs> set type=lofs
    zonecfg:name-node:fs> set options=[ro,nodevices]
    zonecfg:name-node:fs> end
    zonecfg:name-node> add net
    zonecfg:name-node:net> set physical=name_node1
    zonecfg:name-node:net> end
    zonecfg:name-node> verify
    zonecfg:name-node> exit

    Listing 3. Creating the name-node Zone

  3. Create the sec-name-node zone, as shown in Listing 4:
    root@global_zone:~# zonecfg -z sec-name-node
    Use 'create' to begin configuring a new zone.
    Zonecfg:sec-name-node> create
    create: Using system default template 'SYSdefault'
    zonecfg:sec-name-node> set autoboot=true
    zonecfg:sec-name-node> set limitpriv="default,sys_time"
    zonecfg:sec-name-node> set zonepath=/zones/sec-name-node
    zonecfg:sec-name-node> add fs
    zonecfg:sec-name-node:fs> set dir=/usr/local/hadoop
    zonecfg:sec-name-node:fs> set special=/usr/local/hadoop
    zonecfg:sec-name-node:fs> set type=lofs
    zonecfg:sec-name-node:fs> set options=[ro,nodevices]
    zonecfg:sec-name-node:fs> end
    zonecfg:sec-name-node>  add net
    zonecfg:sec-name-node:net> set physical=secondary_name1
    zonecfg:sec-name-node:net> end
    zonecfg:sec-name-node> verify
    zonecfg:sec-name-node> exit

    Listing 4. Creating the sec-name-node Zone

Set Up the DataNode Zones

In this step, we can leverage the integration between Oracle Solaris Zones virtualization technology and the ZFS file system that is built into Oracle Solaris.

Hadoop best practice is to use a separate hard disk for each DataNode. Therefore, every DataNode zone will have its own hard disk in order to provide better I/O distribution, as shown in Figure 2.

Figure 2

Figure 2. Separate Disk for Each DataNode

Table 2 shows a summary of the Hadoop zones configuration we will create:

Table 2. Zone Summary

Function Zone Name ZFS Mount Point VNIC Name IP Address
NameNode name-node /zones/name-node name_node1 192.168.1.1
Secondary NameNode sec-name-node /zones/sec-name-node secondary_name1 192.168.1.2
DataNode data-node1 /zones/data-node1 data_node1 192.168.1.3
DataNode data-node2 /zones/data-node2 data_node2 192.168.1.4
DataNode data-node3 /zones/data-node3 data_node3 192.168.1.5
  1. Get the hard disk names using the following command:
    root@global_zone:~# format
    Searching for disks...done
    AVAILABLE DISK SELECTIONS:
           0. c7t0d0 <LSI-MR9261-8i-2.50-135.97GB>
              /pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@0,0
           1. c7t1d0 <LSI-MR9261-8i-2.50-135.97GB>
              /pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@1,0
           2. c7t2d0 <LSI-MR9261-8i-2.50-135.97GB>
              /pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@2,0
  2. Create a separate ZFS file system for each zone in order to provide better disk I/O performance:
    root@global_zone:~# zpool create -O compression=on data-node1-pool c7t0d0
    root@global_zone:~# zpool create -O compression=on data-node2-pool c7t1d0
    root@global_zone:~# zpool create -O compression=on data-node3-pool c7t2d0
    root@global_zone:~# zfs create -o mountpoint=/zones/data-node1 data-node1-pool/data-node1
    root@global_zone:~# zfs create -o mountpoint=/zones/data-node2 data-node2-pool/data-node2
    root@global_zone:~# zfs create -o mountpoint=/zones/data-node3 data-node3-pool/data-node3
  3. Change the directories’ permissions in order to avoid warring messages during zone creation:
    root@global_zone:~# chmod 700 /zones/data-node1
    root@global_zone:~# chmod 700 /zones/data-node2
    root@global_zone:~# chmod 700 /zones/data-node3
  4. Point the zonecfg zonepath property to the ZFS file systems that you created, as shown in Listing 5.
    root@global_zone:~# zonecfg -z data-node1
    Use 'create' to begin configuring a new zone.
    zonecfg:data-node1> create
    create: Using system default template 'SYSdefault'
    zonecfg:data-node1> set autoboot=true
    zonecfg:data-node1> set limitpriv="default,sys_time"
    zonecfg:data-node1> set zonepath=/zones/data-node1 
    zonecfg:data-node1> add fs
    zonecfg:data-node1:fs> set dir=/usr/local/hadoop
    zonecfg:data-node1:fs> set special=/usr/local/hadoop
    zonecfg:data-node1:fs> set type=lofs
    zonecfg:data-node1:fs> set options=[ro,nodevices]
    zonecfg:data-node1:fs> end
    zonecfg:data-node1> add net
    zonecfg:data-node1:net> set physical=data_node1
    zonecfg:data-node1:net> end
    zonecfg:data-node1> verify
    zonecfg:data-node1> commit
    zonecfg:data-node1> exit
    
    root@global_zone:~# zonecfg -z data-node2
    Use 'create' to begin configuring a new zone.
    zonecfg:data-node2> create
    create: Using system default template 'SYSdefault'
    zonecfg:data-node2> set autoboot=true
    zonecfg:data-node2> set limitpriv="default,sys_time"
    zonecfg:data-node2> set zonepath=/zones/data-node2  
    zonecfg:data-node2> add fs
    zonecfg:data-node2:fs> set dir=/usr/local/hadoop
    zonecfg:data-node2:fs> set special=/usr/local/hadoop
    zonecfg:data-node2:fs> set type=lofs
    zonecfg:data-node2:fs> set options=[ro,nodevices]
    zonecfg:data-node2:fs> end
    zonecfg:data-node2> add net
    zonecfg:data-node2:net> set physical=data_node2
    zonecfg:data-node2:net> end
    zonecfg:data-node2> verify
    zonecfg:data-node2> commit
    zonecfg:data-node2> exit
    
    root@global_zone:~# zonecfg -z data-node3
    Use 'create' to begin configuring a new zone.
    zonecfg:data-node3> create
    create: Using system default template 'SYSdefault'
    zonecfg:data-node3> set autoboot=true
    zonecfg:data-node3> set limitpriv="default,sys_time"
    zonecfg:data-node3> set zonepath=/zones/data-node3
    zonecfg:data-node3> add fs
    zonecfg:data-node3:fs> set dir=/usr/local/hadoop
    zonecfg:data-node3:fs> set special=/usr/local/hadoop
    zonecfg:data-node3:fs> set type=lofs
    zonecfg:data-node3:fs> set options=[ro,nodevices]
    zonecfg:data-node3:fs> end
    zonecfg:data-node3> add net
    zonecfg:data-node3:net> set physical=data_node3
    zonecfg:data-node3:net> end
    zonecfg:data-node3> verify
    zonecfg:data-node3> commit
    zonecfg:data-node3> exit

    Listing 5. Setting the zonecfg zonepath Property

  5. Now, install the name-node zone; later we will clone it in order to accelerate zone creation time.
    root@global_zone:~# zoneadm -z name-node install
    The following ZFS file system(s) have been created:
        rpool/zones/name-node
    Progress being logged to /var/log/zones/zoneadm.20130106T134835Z.name-node.install
           Image: Preparing at /zones/name-node/root.
  6. Boot the name-node zone and check the status of the zones we’ve created, as shown in Listing 6:
    root@global_zone:~# zoneadm -z name-node boot
    root@global_zone:~# zoneadm list -cv
      ID NAME             STATUS     PATH                          BRAND    IP
       0 global           running    /                             solaris shared
       1 name-node        running    /zones/name-node              solaris  excl
       - sec-name-node    configured /zones/sec-name-node          solaris  excl
       - data-node1       configured /zones/data-node1             solaris  excl
       - data-node2       configured /zones/data-node2             solaris  excl
       - data-node3       configured /zones/data-node3             solaris  excl
    
    root@global_zone:~# zlogin -C name-node

    Listing 6. Booting the name-node Zone

  7. Provide the zone host information by using the following configuration for the name-node zone:
    1. For the host name, use name-node.
    2. Ensure the network interface name_node1 has an IP address of 192.168.1.1/24.
    3. Ensure the name service is based on your network configuration.In this article, we will use /etc/hosts for name resolution, so we won’t set up DNS for host name resolution.
  8. After finishing the zone setup, log in to the zone.
  9. Developing for Hadoop requires a Java programming environment. You can install Java Development Kit (JDK) 6 using the following command:
    root@name-node:~# pkg install  jdk-6
  10. Verify the Java installation:
    root@name-node:~# which java
    /usr/bin/java
    root@name-node:~# java -version
    java version "1.6.0_35"
    Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
    Java HotSpot(TM) Server VM (build 20.10-b01, mixed mode)
  11. Create a Hadoop user inside the name-node zone:
    root@name-node:~# groupadd hadoop
    root@name-node:~# useradd -g hadoop hadoop
    root@name-node:~# passwd hadoop
    root@name-node:~# mkdir -p /export/home/hadoop
    root@name-node:~# chown hadoop:hadoop /export/home/hadoop
  12. Configure an NTP client, as shown in the following example:
    1. Install the NTP package:
      root@name-node:~# pkg install ntp
    2. Edit the NTP client configuration file:
      root@name-node:~# grep -v ^# /etc/inet/ntp.conf
      
      server global_zone  prefer
      slewalways yes
      disable pll
    3. Enable the NTP service:
      root@name-node:~# svcadm enable ntp
  13. Add the Hadoop cluster members’ host names and IP addresses to /etc/hosts:
    root@name-node:~# cat /etc/hosts
    
    ::1             localhost
    127.0.0.1       localhost loghost
    192.168.1.1 name-node
    192.168.1.2 sec-name-node
    192.168.1.3 data-node1
    192.168.1.4 data-node2
    192.168.1.5 data-node3

    Note: If you are using the global zone as an NTP server, you must also add its host name and IP address to /etc/hosts.

Set Up SSH

  1. Set up SSH key-based authentication for the Hadoop user on the name_node zone in order to enable password-less login to the Secondary DataNode and the DataNodes:
    root@name-node  # su - hadoop
    
    hadoop@name-node $ ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
    
    hadoop@name-node $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  2. Update $HOME/.profile:
    hadoop@name-node $ cat $HOME/.profile
    
    # Set JAVA_HOME 
    export JAVA_HOME=/usr/java
    
    # Add Hadoop bin/ directory to PATH
    export PATH=$PATH:/usr/local/hadoop/bin
  3. Check that Hadoop runs by typing the following command:
    hadoop@name-node $ hadoop version
    Hadoop 1.0.4
    Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290
    Compiled by hortonfo on Wed Oct  3 05:13:58 UTC 2012
    From source with checksum fe2baea87c4c81a2c505767f3f9b71f4
  4. Create a directory for the Hadoop log files:
    root@name-node:~# mkdir /var/log/hadoop
    root@name-node:~# chown hadoop:hadoop /var/log/hadoop
  5. From the global zone, run the following command to create the sec-name-node zone as a clone of name-node:
    root@global_zone:~# zoneadm -z name-node shutdown 
    root@global_zone:~# zoneadm -z sec-name-node clone name-node
  6. Boot the sec-name-node zone:
    root@global_zone:~# zoneadm -z sec-name-node boot
    root@global_zone:~# zlogin -C sec-name-node
  7. As we experienced previously, the system configuration tool is launched, so do the final configuration for sec-name-node zone:
    1. For the host name, use sec-name-node.
    2. For the network interface, use secondary_name1.
    3. Use an IP address of 192.168.1.2/24.
    4. Ensure the name service is set to none.
  8. Perform similar steps for data-node1, data-node2, and data-node3:
    1. Do the following for data-node1:
      root@global_zone:~# zoneadm -z data-node1 clone name-node
      root@global_zone:~# zoneadm -z data-node1 boot
      root@global_zone:~# zlogin -C data-node1 
      • For the host name, use data-node1.
      • For the network interface, use data_node1.
      • Use an IP address of 192.168.1.3/24.
      • Ensure the name service is set to none.
    2. Do the following for data-node2:
      root@global_zone:~# zoneadm -z data-node2 clone name-node
      root@global_zone:~# zoneadm -z data-node2 boot
      root@global_zone:~# zlogin -C data-node2
      • For the host name, use data-node2.
      • For the network interface, use data_node2.
      • Use an IP address of 192.168.1.4/24.
      • Ensure the name service is set to none.
    3. Do the following for data-node3:
      root@global_zone:~# zoneadm -z data-node3  clone name-node
      root@global_zone:~# zoneadm -z data-node3 boot
      root@global_zone:~# zlogin -C data-node3
      • For the host name, use data-node3.
      • For the network interface, use data_node3.
      • Use an IP address of 192.168.1.5/24.
      • Ensure the name service is set to none.
  9. Boot the name_node zone:
    root@global_zone:~#  zoneadm -z  name-node  boot
  10. Verify that all the zones are up and running:
    root@global_zone:~# zoneadm list -cv
      ID NAME             STATUS     PATH                          BRAND    IP
       0 global           running    /                             solaris shared
      10 sec-name-node    running    /zones/sec-name-node          solaris  excl
      12 data-node1       running    /zones/data-node1             solaris  excl
      14 data-node2       running    /zones/data-node2             solaris  excl
      16 data-node3       running    /zones/data-node3             solaris  excl
      17 name-node        running    /zones/name-node              solaris  excl

Verify the SSH Setup

To verify that you SSH access without using a password for the Hadoop user, do the following.

  1. From the name_node, log in via SSH into the name-node (that is, to itself):
    hadoop@name-node $ ssh name-node 
    The authenticity of host 'name-node (192.168.1.1)' can't be established.
    RSA key fingerprint is 04:93:a9:e0:b7:8c:d7:8b:51:b8:42:d7:9f:e1:80:ca.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added 'name-node,192.168.1.1' (RSA) to the list of known hosts.
  2. Now, try to log in to the sec-name-node and the DataNodes. When you try to log in with SSH the second time, you shouldn’t get any prompt to add the host to the known keys list.

Verify Name Resolution

Verify that all the Hadoop zones have the following host entries in /etc/hosts:

# cat /etc/hosts

::1             localhost
127.0.0.1       localhost loghost
192.168.1.1 name-node
192.168.1.2 sec-name-node
192.168.1.3 data-node1
192.168.1.4 data-node2
192.168.1.5 data-node3

Note: If you are using the global zone as an NTP server, you must also add its host name and IP address to /etc/hosts.

Format the HDFS File System from the NameNode

  1. Run the commands shown in Listing 7:
    root@name-node:~# mkdir -p /hdfs/name
    root@name-node:~# chown -R hadoop:hadoop /hdfs/
    root@name-node:~# su - hadoop
    hadoop@name-node:$ /usr/local/hadoop/bin/hadoop namenode -format
    13/01/06 20:00:32 INFO namenode.NameNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = name-node/192.168.1.1
    STARTUP_MSG:   args = [-format]
    STARTUP_MSG:   version = 1.0.4
    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; 
    compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012
    ************************************************************/

    Listing 7. Formatting the HDFS File System

  2. On every DataNode (data-node1, data-node2, and data-node3), create a Hadoop data directory to store the HDFS blocks:
    root@data-node1:~# mkdir -p /hdfs/data
    root@data-node1:~# chown -R hadoop:hadoop /hdfs/
    root@data-node2:~# mkdir -p /hdfs/data
    root@data-node2:~# chown -R hadoop:hadoop /hdfs/
    root@data-node3:~# mkdir -p /hdfs/data
    root@data-node3:~# chown -R hadoop:hadoop /hdfs/

Start the Hadoop Cluster

Table 3 describes the startup scripts.

Table 3. Startup Scripts

Filename Description
start-dfs.sh Starts the Hadoop DFS daemons, the NameNode, and the DataNodes. Use this before start-mapred.sh.
stop-dfs.sh Stops the Hadoop DFS daemons.
start-mapred.sh Starts the Hadoop MapReduce daemons, the JobTracker, and TaskTrackers.
stop-mapred.sh Stops the Hadoop MapReduce daemons.
  1. From the name-node zone, start the Hadoop DFS daemons, the NameNode, and the DataNodes using the following command:
    hadoop@name-node:$ start-dfs.sh
    starting namenode, logging to /var/log/hadoop/hadoop--namenode-name-node.out
    data-node2: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-data-node2.out
    data-node1: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-data-node1.out
    data-node3: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-data-node3.out
    sec-name-node: starting secondarynamenode, logging to 
    /var/log/hadoop/hadoop-hadoop-secondarynamenode-sec-name-node.out
  2. Start the Hadoop Map/Reduce daemons, the JobTracker, and TaskTrackers using the following command:
    hadoop@name-node:$ start-mapred.sh
    starting jobtracker, logging to /var/log/hadoop/hadoop--jobtracker-name-node.out
    data-node1: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-data-node1.out
    data-node3: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-data-node3.out
    data-node2: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-data-node2.out
  3. To view a comprehensive status report, execute the command shown in Listing 8 to check the cluster status. The command will output basic statistics about the cluster health, such as NameNode details, the status of each DataNode, and disk capacity amounts.
    hadoop@name-node:$ hadoop dfsadmin -report
    Configured Capacity: 171455269888 (159.68 GB)
    Present Capacity: 169711053357 (158.06 GB)
    DFS Remaining: 169711028736 (158.06 GB)
    DFS Used: 24621 (24.04 KB)
    DFS Used%: 0%
    Under replicated blocks: 0
    Blocks with corrupt replicas: 0
    Missing blocks: 0
    -------------------------------------------------
    Datanodes available: 3 (3 total, 0 dead)
    ...

    Listing 8. Checking the Cluster Status

    You can find the same information on the NameNode Web status page at http://<namenode IP address>:50070/dfshealth.jsp.

    Figure 3

    Figure 3. Cluster Summary

Run a MapReduce Job

MapReduce is a framework for processing parallelizable problems across huge data sets using a cluster of computers. For more information about MapReduce, see http://en.wikipedia.org/wiki/MapReduce.

We will use the WordCount example, which reads text files and counts how often words occur. The input and output consist of text files, each line of which contains a word and the number of times the word occurred, separated by a tab. For more information about WordCount, see http://wiki.apache.org/hadoop/WordCount.

  1. For the input file, download the following eBook from Project Gutenberg as a plain-text file with UTF-8 encoding, and store the file in a temporary directory of choice, for example /tmp/data: The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson.
  2. Copy the file to HDFS using the following command:
    hadoop@name-node:$ hadoop dfs -copyFromLocal /tmp/data/ /hdfs/data
  3. Verify that the file is located on HDFS:
    hadoop@name-node:$ hadoop dfs -ls /hdfs/data
    Found 1 items
    -rw-r--r--   3 hadoop supergroup     661664 2013-01-07 19:45 /hdfs/data/20417-8.txt
  4. Start the MapReduce job using the command shown in Listing 9:
    hadoop@name-node:$ hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar wordcount /hdfs/data /hdfs/data/data-output
    13/01/07 15:20:21 INFO input.FileInputFormat: Total input paths to process : 1
    13/01/07 15:20:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library 
    for your platform... using builtin-java classes where applicable
    13/01/07 15:20:21 WARN snappy.LoadSnappy: Snappy native library not loaded
    13/01/07 15:20:22 INFO mapred.JobClient: Running job: job_201301071307_0006
    13/01/07 15:20:23 INFO mapred.JobClient:  map 0% reduce 0%
    13/01/07 15:20:38 INFO mapred.JobClient:  map 100% reduce 0%
    13/01/07 15:20:50 INFO mapred.JobClient:  map 100% reduce 100%
    13/01/07 15:20:55 INFO mapred.JobClient: Job complete: job_201301071307_0006
    13/01/07 15:20:55 INFO mapred.JobClient: Counters: 26
    13/01/07 15:20:55 INFO mapred.JobClient:   Job Counters

    Listing 9. Starting the MapReduce Job

  5. Verify the output data:
    hadoop@name-node:$ hadoop dfs -ls /hdfs/data/data-output
    Found 3 items
    -rw-r--r--   3 hadoop supergroup          0 2013-01-07 15:20 /hdfs/data/data-output/_SUCCESS
    drwxr-xr-x   - hadoop supergroup          0 2013-01-07 15:20 /hdfs/data/data-output/_logs
    -rw-r--r--   3 hadoop supergroup     196288 2013-01-07 15:20 /hdfs/data/data-output/part-r-00000
  6. The output data can contain sensitive information, so use ZFS encryption to protect the output data:
    1. Create the encrypted ZFS data set:
      root@name-node:~# zfs create -o encryption=on rpool/export/output
      Enter passphrase for 'rpool/export/output':
      Enter again:
    2. Change the ownership:
      root@name-node:~# chown hadoop:hadoop /export/output/
    3. Copy the output file from the Hadoop HDFS into ZFS:
      root@name-node:~# su - hadoop
      Oracle Corporation      SunOS 5.11      11.1    September 2012
      
      hadoop@name-node:$ hadoop dfs -getmerge /hdfs/data/data-output /export/output/
  7. Analyze the output text file. Each line contains a word and the number of times the word occurred, separated by a tab.
    hadoop@name-node:$ head /export/output/data-output
    "A      2
    "Alpha  1
    "Alpha,"        1
    "An     2
    "And    1
    "BOILING"       2
    "Batesian"      1
    "Beta   2
  8. Protect the output text file by unmounting the ZFS data set, and then unload the wrapping key for an encrypted data set using the following command:
    root@name-node:~# zfs key -u rpool/export/output

    If the command is successful, the data set is not accessible and it is unmounted.

    If you want to mount this ZFS file system, you need to provide the passphrase:

    root@name-node:~# zfs mount rpool/export/output
    Enter passphrase for 'rpool/export/output':

    By using a passphrase, you ensure that only those who know the passphrase can observe the output file. For more information about ZFS encryption, see “How to Manage ZFS Data Encryption.”