Using LDAP for Shared Authentication (NFS v4 Requirement instead of NFS v3)

NFS v3 is a stateless protocol so in order to achieve file locking it relies on the Network Lock Manager (NLM) to maintain consistency of any file. The NLM protocol in turn relies on the client IP address to keep a track of the owner of a particular lock. Because an Exalogic is a multi-homed machine (many IP addresses) it is possible for a lock request to come from one IP address and a release from another IP, this causes the NLM to not honour the request and a lock is not released. This causes problems with Weblogic as it is then unable to get the lock and hence fail to startup. Using NFS v4 resolves this problem because NFS v4 is a stateful protocol, the client being given a client ID during the initial negotiation phase and then re-using the same client ID for all subsequent requests. This means that the client IP address is not used to track lock ownership and so the impact of multiple IP addresses on the client is not an issue and hence Exalogic operates as desired using NFS v4.
NFS v4 is a fairly significant improvement to the NFS v3 protocol, in particular around the area of security where a client accessing the share also includes the user’s security principle which is used to secure the file. This means that NFSv4 requires some form of shared repository of users so that it can check this to provide appropriate access to any particular file or directory. Two options are widly used in the industry for this functionality, historically the Network Information Service (NIS) also known as Yellow Pages (YP) the other option is to make use of a Directory Server via the Lightweight Directory Access Protocol or LDAP.
This posting explains how to install and get a very basic LDAP server operational and configure the Linux compute nodes and ZFS storage appliance of an Exalogic to use LDAP to access the users, passwords, hosts etc. from the directory.

Directory Server Architecture

In the first place a very simple non-HA approach is taken with one compute node of an Exalogic machine nominated as the host for the directory and all other nodes use its services to enable centralised authentication and management.  Future postings will cover of setting up the directory to be multi-master/master-slave configurations so that the system becomes fault tolerant.
We will also be considering security, both at the transport layer where we will change the directory to use LDAPS so that it is not possibe to pick up transports from the network interchange and within the directory structure so that users have access to various combinations of compute nodes.

Prerequisites

First off create a group and user on your machine that we can use for testing the directory.
For example:-
# groupadd -g 500 donald
# useradd -g donald -d /home/donald -m -s /bin/bash -u 500 donald
Then login as this user to check that all is fine.

Installing openldap

The first step towards getting the shared repository setup is to get the directory server configured. In this example we are using openldap which is available as a package in the Oracle public Yum repository, alternatives include the likes of Oracle Internet Directory, Oracle Enterprise Directory Server, OpenDS etc. For openldap a number of packages must be installed onto the compute nodes. For the nodes that will host a directory server instance (Master or Slave) the following packages are required:-
  • openldap
  • openldap-clients
  • openldap-devel
  • nss_ldap
  • openldap-servers
and for the machines that will only operate as a client to the directory service the following packages should be installed:-
  • openldap
  • openldap-clients
  • openldap-devel
  • nss_ldap
In order to do this either setup your machine to connect to the Oracle public Yum repository or download the necessary packages.
For installing using yum:-
# yum install openldap-clients
Loaded plugins: security
Setting up Install Process
Resolving Dependencies
–> Running transaction check
—> Package openldap-clients.x86_64 0:2.3.43-12.el5 set to be updated
–> Finished Dependency Resolution

Dependencies Resolved

================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
openldap-clients x86_64 2.3.43-12.el5 el5_u5_base 221 k

Transaction Summary
================================================================================
Install 1 Package(s)
Upgrade 0 Package(s)

Total download size: 221 k

Is this ok [y/N]: y
Downloading Packages:
openldap-clients-2.3.43-12.el5.x86_64.rpm | 221 kB 00:00
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : openldap-clients 1/1

Installed:
openldap-clients.x86_64 0:2.3.43-12.el5

Complete!
# yum install openldap-devel

.
.
# yum install openldap-servers
.
.
(On my test systems openldap and nss_ldap packages had already been installed.)

Configuring the Directory Server

The main configuration files for openldap are all held in /etc/openldap. Open slapd.conf and set the value of the suffix and rootdn to something appropriate for the environment. This file also contains the directory on the file system that will be used to host the directory instance. By default this is set to /var/lib/ldap, change this to match your directory name.
You can also set the password for the Manager user by putting it into the slapd.conf file. This can be set unencrypted but should really be an encrypted value for security. Openldap includes the utility slappasswd which will encrypt a password entered and give the encrypted value back to standard out. Use this facility to encrypt your password and then put the value into the configuration file.
For example:-
[root@vbelcn01 openldap]# slappasswd
New password:
Re-enter new password:
{SSHA}Y0iT4bFZGz1QQ34NTjqwmc4N7BFpURsN
And copy the encrypted value into the slapd.conf file.
#######################################################################
# ldbm and/or bdb database definitions
#######################################################################

database bdb

suffix “dc=el01,dc=com”
rootdn “cn=Manager,dc=el01,dc=com”
rootpw {SSHA}Y0iT4bFZGz1QQ34NTjqwmc4N7BFpURsN
directory /var/lib/ldap/el01.com
Now create the directory that will be used to host the Directory Server instance and set the ownership to the ldap user. (The ldap user is created when the packages are installed.)
# mkdir /var/lib/ldap/el01.com
# chown ldap:ldap /var/lib/ldap/el01.com

Create the directory and populate with existing values

The directory is to be used to manage the users, groups, hosts etc., openldap comes with a number of scripts that can be used to copy existing values into the directory. These scripts are installed into /usr/share/openldap/migration. Start by editing the migrate_common.ph script and replace all instances of the domain padl.com with the domain that you are using for your directory.
For example, in our case we are using the domain el01.com so substitute all string occurances of “padl” with “el01”.
Run updatedb then locate DB_CONFIG to create the configration file for the database.
[root@vbelcn01 el01.com]# updatedb
[root@vbelcn01 el01.com]# locate DB_CONFIG
/etc/openldap/DB_CONFIG.example
[root@vbelcn01 el01.com]# cp /etc/openldap/DB_CONFIG.example /var/lib/ldap/el01.com/DB_CONFIG
Run the migrate_all_offline.ph script to populate the directory with the values from your various files.
[root@vbelcn01 tmp]# cd /usr/share/openldap/migration
[root@vbelcn01 migration]# ./migrate_all_offline.sh 
Creating naming context entries…
Migrating groups…
Migrating hosts…
Migrating networks…
Migrating users…
Migrating protocols…
Migrating rpcs…
Migrating services…
Migrating netgroups…
Importing into LDAP…
Migrating netgroups (by user)…
Migrating netgroups (by host)…
Preparing LDAP database…
=> bdb_tool_entry_put: id2entry_add failed: DB_KEYEXIST: Key/data pair already exists (-30996)
=> bdb_tool_entry_put: txn_aborted! DB_KEYEXIST: Key/data pair already exists (-30996)
slapadd: could not add entry dn=”dc=vbel,dc=com” (line=5): txn_aborted! DB_KEYEXIST: Key/data pair already exists (-30996)
Migration failed: saving failed LDIF to /tmp/nis.ldif.z17552

The error reported is linked to the services available on the server and as we are not planning on using these via LDAP lookups the error does not seem to impact the behaviour of the directory.   Be warned, if you are doing this in an environment that you have been using for a while take care to ensure you have no duplicate entries in your hosts file.  If you do then the migrate script fails to update the directory due to a duplicate entry and the script will stop.  The error reported will be similar to:-

bdb_tool_entry_put: id2entry_add failed: DB_KEYEXIST: Key/data pair already exists (-30996)
bdb_tool_entry_put: txn_aborted! DB_KEYEXIST: Key/data pair already exists (-30996)
slapadd: could not add entry dn=”cn=exatest2.oracle.com,ou=Hosts,dc=el01,dc=com” (line=469): txn_aborted! DB_KEYEXIST: Key/data pair already exists (-30996)
Migration failed: saving failed LDIF to /tmp/nis.ldif.DA3373

Now we want to start up the directory. Prior to issuing the command to start the daemon we must change all the files in the directory to be owned by ldap:ldap.

[root@vbelcn01 migration]# cd /var/lib/ldap/el01.com/
[root@vbelcn01 el01.com]# ls -l
total 3312
-rw-r–r– 1 root root 2048 Jan 4 11:00 alock
-rw——- 1 root root 122880 Jan 4 10:59 cn.bdb
-rw——- 1 root root 24576 Jan 4 11:00 __db.001
-rw——- 1 root root 368640 Jan 4 11:00 __db.002
-rw——- 1 root root 270336 Jan 4 11:00 __db.003
-rw——- 1 root root 98304 Jan 4 11:00 __db.004
-rw——- 1 root root 557056 Jan 4 11:00 __db.005
-rw——- 1 root root 24576 Jan 4 11:00 __db.006
-rw-r—– 1 root root 921 Jan 4 10:54 DB_CONFIG
-rw——- 1 root root 86016 Jan 4 10:59 dn2id.bdb
-rw——- 1 root root 12288 Jan 4 10:59 gidNumber.bdb
-rw——- 1 root root 245760 Jan 4 10:59 id2entry.bdb
-rw——- 1 root root 10485760 Jan 4 11:00 log.0000000001
-rw——- 1 root root 8192 Jan 4 10:59 loginShell.bdb
-rw——- 1 root root 8192 Jan 4 10:59 memberUid.bdb
-rw——- 1 root root 8192 Jan 4 10:59 nisMapName.bdb
-rw——- 1 root root 36864 Jan 4 10:59 objectClass.bdb
-rw——- 1 root root 8192 Jan 4 10:59 ou.bdb
-rw——- 1 root root 20480 Jan 4 10:59 uid.bdb
-rw——- 1 root root 8192 Jan 4 10:59 uidNumber.bdb
[root@vbelcn01 el01.com]# chown -R ldap:ldap *
Now start the DB and ensure it starts on boot.
[root@vbelcn01 vbel.com]# service ldap start
Checking configuration files for slapd: config file testing succeeded
[ OK ]
Starting slapd: [ OK ]
[root@vbelcn01 vbel.com]# chkconfig ldap on
[root@vbelcn01 vbel.com]# chkconfig –list ldap
ldap 0:off 1:off 2:on 3:on 4:on 5:on 6:off
Then check that it is accessible and contains the values expected by issuing an ldapsearch.
[root@vbelcn01 vbel.com]# ldapsearch -x -b “ou=People,dc=vbel,dc=com” ‘(uidNumber=500)’
# extended LDIF
#
# LDAPv3
# base <ou=People,dc=vbel,dc=com> with scope subtree
# filter: (uidNumber=500)
# requesting: ALL
#

# donald, People, vbel.com
dn: uid=donald,ou=People,dc=vbel,dc=com
uid: donald
cn: Donald Forbes
objectClass: account
objectClass: posixAccount
objectClass: top
objectClass: shadowAccount
userPassword:: e2NyeXB0fSQxJHNtMHNQRVpLJG9XUm5VdGkzNnJJc0NkOEVMNTU5Mi4=
shadowLastChange: 15342
shadowMin: 0
shadowMax: 99999
shadowWarning: 7
loginShell: /bin/bash
uidNumber: 500
gidNumber: 500
homeDirectory: /home/donald
gecos: Donald Forbes

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1
[root@vbelcn01 vbel.com]#

Note – in this case I narrowed the search criteria down to specify my own user record that is identified with the UID of 500. The search criteria can be wider to return all the hosts, users, services etc.
[root@vbelcn01 vbel.com]# ldapsearch -x -b “dc=vbel,dc=com” ‘(objectclass=*)’

Configuring a client to use LDAP

Two approaches possible, either by using the authconfig command which will automatically set the values in multiple files or by manually editing the files.

The authconfig command will be similar to shown below:-

 authconfig –enableldap –enableldapauth –ldapserver=<LDAP server IP>:389 –ldapbasedn=”dc=el01,dc=com” –update

It is also important to edit the /etc/idmapd.conf file to add the domain as described in the section about editing individual files.

If you want to do this by hand then:-
First edit the /etc/ldap.conf file to specify the directory server host IP address and the base DN used.

host 192.168.23.105
base dc=el01,dc=com
Edit the /etc/nsswitch.conf file to specify all the entries that you would like to have resolved in ldap, specifying the resolution order. For example:-
passwd: files ldap
shadow: files ldap
group: files ldap
Setup the identity mapping ready for mapping NFS v4 users. This is managed via the idmapd daemon in Linux. Edit the /etc/idmapd.conf file and specify the domain name you wish to use. For example setting it to el01 would result in an idmapd.conf file that looks like:-
[root@vbelcn01 etc]# cat /etc/idmapd.conf
[General]

Verbosity = 0
Pipefs-Directory = /var/lib/nfs/rpc_pipefs
Domain = el01

[Mapping]

Nobody-User = nobody
Nobody-Group = nobody

[Translation]
Method = nsswitch

Now restart the idmapd service.
# service rpcidmapd stop
# service rpcidmapd start

Test your users

Firstly check that you can log in as the user you created at the beginning of this exercise. If all is OK then we can can continue to create a new user in LDAP and see if we can authenticate as this user.
There are various command line tools that can be used to add users, modify them etc. – ldapadd ldapsearch etc. However for a test system you may well have access to the GUI tools, there are many available but for flexibility I have found the Apache studio to be very useful. (http://directory.apache.org/studio/) Using one of these tools create a user similar to the one you created using the unix command useradd. (I exported the my user to an LDIF file, edited the file to change the dn, cn, uidNumber, uid and password then imported the changed ldif back into the directory. This approach means all the objectClass requirements are met.)
Having got the new entry in the directory then attempt to log in as that user, change its password and logout then re-authenticate.  (Note – If you are having a problem authenticating try setting the password in the directory to use crypt with a new salt.)

Setup the Shared Storage to use LDAP Authentication

Select the LDAP service and specify the root dn to search from, namely dc=el01,dc=com assuming this was the suffix you are using in your directory server. Select the Search Scope to be “sub-tree” which is a recursive search and will hunt all units below the base dn. Now click on the + sign by the servers to put in the IP address and port of the server running the directory.
Enable the LDAP service.
To configure the ZFS storage appliance so that it uses LDAP simply ensure that the NIS service is turned off and the NFS service has the option to use the DNS domain unselected. Then you need to put into the custom identity domain field the value of the domain as used in the /etc/idmapd.conf file on each client compute node. The default name for this is localdomain but it is recommended that you switch this to reflect your environment – eg. el01.

Testing the Shared Storage is configured correctly.

Having made the configuration changes to the shared storage the simplest mechanism to ensure that it has been configured correctly is to create a project and a share, giving the share “Root Directory Access” of one of the users & group that you have setup in the directory server. The software complains if the username/group are invalid. If this is Ok you can double check by “shelling” out to the command line and list the directory you have just created for the share, the file permissions should match the user/group you have in LDAP.
For example, the configuration details for a test share on /export/ldaptest/test are shown below:-
vbzfs:> shares
vbzfs:shares> select ldaptest
vbzfs:shares ldaptest> select test
vbzfs:shares ldaptest/test> ls
Properties:
aclinherit = restricted (inherited)
atime = true (inherited)
casesensitivity = mixed
checksum = fletcher4 (inherited)
compression = off (inherited)
dedup = false (inherited)
compressratio = 100
copies = 1 (inherited)
creation = Wed Jan 11 2012 12:19:52 GMT+0000 (UTC)
logbias = latency (inherited)
mountpoint = /export/ldaptest/test (inherited)
normalization = none
quota = 0
quota_snap = true
readonly = false (inherited)
recordsize = 128K (inherited)
reservation = 0
reservation_snap = true
secondarycache = all (inherited)
shadow = none
nbmand = false (inherited)
sharesmb = off (inherited)
sharenfs = sec=sys,anon=ldaptest,rw=@192.168.23.0/24,root=@192.168.23.0/24 (inherited)
snapdir = hidden (inherited)
utf8only = true
vscan = false (inherited)
sharedav = off (inherited)
shareftp = off (inherited)
sharesftp = off (inherited)
sharetftp = (inherited)
pool = exalogic
canonical_name = exalogic/local/ldaptest/test
exported = true (inherited)
nodestroy = false
space_data = 31K
space_unused_res = 0
space_snapshots = 0
space_available = 69.9G
space_total = 31K
root_group = donald
root_permissions = 755
root_user = ldaptest
origin =

Children:
snapshots => Manage snapshots
replication => Manage remote replication
users => View per-user usage and manage user quotas
groups => View per-group usage and manage group
quotas
shadow => Manage shadow data migration

vbzfs:shares ldaptest/test> 

Note – This is simply to test the shared storage has the correct visibility of the directory server and is authenticating correctly.  Under normal circumstances the root user and group should be set to nobody:nobody for security reasons.  Only changing when there is a specific reason to do so.

Testing the NFS version 4 mounts

Now we are all ready to go and we would like to mount the nfs v4 shares. To do this persistently we need to firstly create the directory where we will mount the share.
# mdkir -p /u01/ldaptest/test
And then add an entry to the /etc/fstab file to specify the mount point:
vbzfs:/export/ldaptest/test /u01/ldaptest/test nfs4 rw,bg,hard,nointr,rsize=131072,wsize=131072 0 0
This can be mounted with the command
# mount -a
and unmounted
# umount -a -t nfs4 (to unmount all nfs4 partitions listed in the /etc/fstab file.)
or
# umount /u01/ldaptest/test (to unmount the one specific partition.)

Need to Remount File Systems After System Reboots When NFSv4 Is Used

When you use NFSv4, you are required to re-mount all file systems by running the mount -a command on Oracle Linux each time you reboot the compute node.

To automatically re-mount the file systems rebooting a compute node running Oracle Linux, complete the following steps:

1.Log in to the compute node as a root user.

2.Create a script named nfs4_automount_on_init in /etc/init.d/ directory (Log in as root user)

# vi /etc/init.d/ nfs4_automount_on_init

3.Add the following to the script:

#!/bin/bash
#chkconfig: 345 80 05
#description: NFSv4 Automount
case “$1” in
“start”)
mount -a
;;
“stop”)
;;
esac
Note:

In chkconfig: 345 80 05, the first number 345 indicates the runLevels. The second number 80 is the start priority, and the third number 05 is the stop priority. The second number and the third number should be unique among all registered services, so you may need to adjust the values for your environment, if you have installed any other services.

4.Set up executable permission on the script:

# chmod +x /etc/init.d/ nfs4_automount_on_init

5.Run the following command to register the script:

# chkconfig –add nfs4_automount_on_init

6.Optionally, run the following command to verify the registration of the script:

# chkconfig –list

Advertisements

Are you getting the most out of your Exadata? “By Tanel Poder”

Checking Whether Smart Scanning Is Used

Let’s first see how to identify whether your queries are taking full advantage of the Exadata “secret sauce” – Smart Scans. Here’s a simple execution plan, which has some Exadata- specific elements in it:

Ekran Resmi 2014-07-01 16.36.59

 

Ekran Resmi 2014-07-01 16.38.34

 

All the TABLE ACCESS FULL row sources do have the STORAGE keyword in them. It is important to know that this does not mean that Exadata smart scans are used. The STORAGE keyword in row sources means that this execution plan is using an Exadata- storage aware and smart scan capable row source, but it doesn’t tell us whether a smart scan actually happened for given execution. Also, in the bottom you see two storage() predicates in addition to the usual filter() and access() ones. This shows that if a smart scan were chosen for scanning some segments during given SQL execution, then Oracle would

attempt to offload the predicate conditions into the cells. If the smart scan was not used, then predicate offloading did not happen either (as the filter predicate offloading is part of smart scan).

Important:

The key thing to remember is that Exadata smart scan is not only a property of an execution plan – it’s actually a runtime decision done separately for each table/index/partition segment accessed with a full scan. It is not possible to determine whether a smart scan happened just by looking into the execution plan, you should measure execution metrics from V$SQL or V$SESSION to be sure.

It is also possible that a smart scan is attempted against a segment, but during the smart scan execution, the cells have to fall back to regular block IO mode for some blocks and ship the blocks back to database for processing – instead of extracting rows from them inside the cell. For example, this happens for blocks for which the consistent reads require access to undo data or there are chained rows in a block. There are more reasons and we explain these in detail in the upcoming Expert Oracle Exadata book.

This should also explain why is there a filter predicate (executed in the database layer) in addition to every storage predicate in the plan, because sometimes the filtering cannot be entirely offloaded to the storage cells and the database layer has to perform the final filtering.

Serial Execution Without Smart Scan

Let’s look into an example query now, executed in serial mode at first. Basically it’s returning all customers who have ever made orders where the total order amount exceeds their customer-specific credit limit. All the tables are partitioned and their total size is around 11GB. Note that this query is written against a regular normalized OLTP-style schema, not a star- or snowflake schema which is you are likely using in your DW databases. But for purpose of this demo it should be enough.

Ekran Resmi 2014-07-01 16.39.43

When executed on an otherwise idle quarter rack Exadata V2 installation, it took 151 seconds to run, which seems too much, knowing that the smart scans on even a quarter rack can scan data on hard disks multiple gigabytes per second (and even faster from flash cache).

So, I will identify the SQL ID, child cursor number and execution ID of this current ongoing SQL execution first:

Ekran Resmi 2014-07-01 16.40.50

Now, let’s look into V$SQL first, using the SQL ID and child cursor number:

Ekran Resmi 2014-07-01 16.41.26

The physical_read_bytes metric (phyrd_mb) shows that Oracle database layer has issued 10833 MB worth of IO calls for this SQL. And the io_interconnect_bytes (ret_mb) shows that this query has also used 10833 MB worth of IO interconnect traffic (between database host and storage cells). So, the smart scans were not able to reduce the cell-database IO traffic at all. In fact, when looking into io_cell_offload_eligible_bytes (elig_mb), it’s zero. This means that the database has not even tried to do smart scan offloading for this statement. If 10GB worth of segments would be read via smart scans, then the “eligible bytes for offload” would also show 10GB. So, this io_cell_offload_eligible_bytes metric is a key for determining whether any offloading has been attempted for a query. Note that V$SQL accumulates statistics over multiple executions of the same query, so if this cursor has already been executed before (and is still in cache) you should not look into the absolute values in the V$SQL columns, but rather by how much they increase (calculate deltas from before- and after-test values).

Another option is to look into what the session(s) executing this SQL are doing and waiting for. You can use SQL trace or ASH for this, or any other tool, which is capable of extracting the wait information from Oracle. Here’s a wait profile example from the problem session, taken with Oracle Session Snapper

Ekran Resmi 2014-07-01 16.42.11

 

Indeed we aren’t seeing the cell smart table scan (or cell smart index scan) wait events, but a cell multiblock physical read wait event as the top one – which shows that half of the query time is spent doing regular multiblock IOs from storage to database. Note that there’s still a chance that smart scans are happening for some tables involved in the query (but they don’t show up in the top waits thanks to their fast, asynchronous nature), but the regular multiblock reads show up thanks to rest of the tables not using smart scan. Luckily it’s possible to check exactly which execution plan row sources do use the smart scan and how much do they benefit from it.

We can get this information from the Real Time SQL Monitoring views, either by manually querying V$SQL_PLAN_MONITOR or by Grid Control/Database Control UI tools. I like to always know where the data is coming from, so let’s start from a manual query. Note that the sql_id and sql_exec_id values are taken from my previous query against V$SESSION above:

Ekran Resmi 2014-07-01 16.42.47

The V$SQL_PLAN_MONITOR doesn’t have the io_offload_eligible_bytes column showing how many bytes worth of segments were scanned using smart scans, but nevertheless the io_interconnect_bytes column tells us how many bytes of IO traffic between the database host and cells were done for the given access path. In the above example, exactly the same amount of data was returned by cells (RET_MB) as requested by database (PHYRD_MB), so there was no interconnect traffic saving at all. When these numbers exactly match, this is a good indication that no offloading was performed (as all the IO requested by database was returned in untouched blocks to the database and no data reduction due to filtering and projection offloading was performed in the cells). So, this is additional confirmation that none of the tables were scanned with smart scan.

I caused the above problem deliberately. I just set the undocumented parameter _small_table_threshhold to a big value (1000000 blocks) in my testing session. This made Oracle table scanning function think that all the scanned partitions were “small” segments,

which should be scanned using regular buffered reads as opposed to direct path reads, which are a pre-requisite for smart scans. Note that the above query was executed in serial mode, not parallel. Starting from Oracle 11g, Oracle can decide to use direct path reads (and as a result smart scans) even for serial sessions. The direct path read decision is done during runtime, separately for each table/index/partition segment accessed – and it’s dependent on how many blocks this segment has under its HWM and what’s the buffer cache size etc. So, you might encounter such an issue of Oracle deciding to not use direct path reads (and thus smart scan) in real life too – and as it’s an automatic decision, it may change unexpectedly.

Serial Execution With Smart Scan

Let’s run the same query without my trick to disable smart scans, I’m using default session parameters now. The query, still executed in serial, tool 26 seconds (as opposed to 151 seconds previously). The wait profile looks different, the cell smart table scan wait is present and there are no regular block IO related waits. The CPU usage percentage is higher, but remember that this query completed over 5 times faster than previously. Higher CPU usage with lower response time is a good thing, as high CPU usage means that your query spent less time waiting instead of working and wasn’t throttled by IO and other issues.

Ekran Resmi 2014-07-01 16.43.42

For simple enough queries you might actually see that only a couple of percent of response time is spent waiting for smart scans and all the rest is CPU usage. This happens thanks to the asynchronous nature of smart scans, where cells work for the database sessions independently and may be able to constantly have some data ready for the DB session to consume.

Now, let’s look into V$SQL entries of that cursor (a new child cursor 1 was parsed thanks to me changing the _small_table_threshold parameter value back):

Ekran Resmi 2014-07-01 16.44.19

 

Apparently all the physical IOs requested were requested using the smart scanning method, as ELIG_MB is the same as PHYRD_MB. Apparently some filtering and projection was done in the cell as the interconnect traffic (RET_MB) of that statement is about 69.2% less than the scanned segments sizes on disk.

Let’s drill down into individual execution plan lines, which allows us to look into the efficiency of accessing individual tables. In real life DWs you are more likely dealing with 10..20 table joins, not 3-table joins. Note that the sql_exec_id has increased as I’ve re- executed the SQL statement:

Ekran Resmi 2014-07-01 16.44.55

The above output clearly shows that the ORDERS and CUSTOMERS tables benefit from smart scan the most.

Note that the execution plan join order has changed too, this is thanks to the automatic cursor re-optimization using cardinality feedback from previous child cursor’s execution statistics.

Summary

In order to get the most out of your Exadata performance for your DW application, you’ll have to use smart scans. Otherwise you won’t be able to take advantage of the cells computing power and its IO reduction features. Full segment scans (like full table/partition scans and fast full index scans) are a pre-requisite for smart scans. Additionally, direct path reads have to be used in order to the smart scans to kick in. For serial sessions, the direct path read decision is done based on the scanned segment size, buffer cache size and some other factors. For parallel execution, direct path access is always used for full scans, unless you use the new parallel_degree_policy = AUTO feature, in which case the decision would again be dynamic.