Making Hadoop Accessible to your Employees with LDAP

August 29, 2015 Leave a comment

Hue easily integrates with your corporation’s existing identity management systems and provides authentication mechanisms for SSO providers. By changing a few configuration parameters, your employees can start doing big data analysis in their browser by leveraging an existing security policy.

This blog post details the various features and capabilities available in Hue for LDAP:

  1. Authentication
  2. Search bind
  3. Direct bind

Importing users

Importing groups

Synchronizing users and groups

  1. Attributes synchronized 
  2. Useradmin interface
  3. Command line interface

LDAP search

Case sensitivity

LDAPS/StartTLS support

Debugging

Notes

Summary

1.    Authentication

The typical authentication scheme for Hue takes of the form of the following image:

Ekran Resmi 2015-08-29 17.17.32

Passwords are saved into the Hue databases.

With the Hue LDAP integration, users can use their LDAP credentials to authenticate and inherit their existing groups transparently. There is no need to save or duplicate any employee password in Hue:

Ekran Resmi 2015-08-29 17.17.45

There are several other ways to authenticate with Hue: PAM, SPNEGO, OpenID, OAuth, SAML2, etc. This section details how Hue can authenticate against an LDAP directory server.

When authenticating via LDAP, Hue validates login credentials against a directory service if configured with this authentication backend:

1

2

3

[desktop]

 [[auth]]

 backend=desktop.auth.backend.LdapBackend

The LDAP authentication backend will automatically create users that don’t exist in Hue by default. Hue needs to import users in order to properly perform the authentication. The password is never imported when importing users. The following configuration can be used to disable automatic import:

1

2

3

[desktop]

  [[ldap]]

  create_users_on_login=false

The purpose of disabling the automatic import is to only allow to login a predefined list of manually imported users.

The case sensitivity of the authentication process is defined in the “Case sensitivity” section below.

Note

If a user is logging in as A before enabling LDAP auth and then after enabling LDAP auth logs in as B,  all workflows, queries etc will be associated with the user A and be unavailable. The old workflows would need to have their owner fields changed to B: this can be done in the Hue shell.

There are two different ways to authenticate with a directory service through Hue:

  1. Search bind
  2. Direct bind

1.1.    Search bind

The search bind mechanism for authenticating will perform an ldapsearch against the directory service and bind using the found distinguished name (DN) and password provided. This is, by default, used when authenticating with LDAP. The configurations that affect this mechanism are outlined in “LDAP search”.

1.2.    Direct bind

The direct bind mechanism for authenticating will bind to the ldap server using the username and password provided at login. There are two options that can be used to choose how Hue binds:

  1. nt_domain – Domain component for User Principal Names (UPN) in active directory. This active directory specific idiom allows Hue to authenticate with active directory without having to follow LDAP references to other partitions. This typically maps to the email address of the user or the users ID in conjunction with the domain.
  2. ldap_username_pattern – Provides a template for the DN that will ultimately be sent to the directory service when authenticating.

If ‘nt_domain’ is provided, then Hue will use a UPN to bind to the LDAP service:

1

2

3

[desktop]

  [[ldap]]

  nt_domain=example.com

Otherwise, the ‘ldap_username_pattern’ configuration is used (the <username> parameter will be replaced with the username provided at login):

1

2

3

[desktop]

    [[ldap]]

    ldap_username_pattern=”uid=&lt;username&gt;,ou=People,DC=hue-search,DC=ent,DC=cloudera,DC=com”

Typical attributes to search for include:

  1. uid
  2. sAMAccountName

To enable direct bind authentication, the ‘search_bind_authentication’ configuration must be set to false:

1

2

3

[desktop]

    [[ldap]]

    search_bind_authentication=false

2.    Importing users

If an LDAP user needs to be part of a certain group and have a particular set of permissions, then this user can be imported via the Useradmin interface:

Ekran Resmi 2015-08-29 17.17.54

As you can see, there are two options available when importing:

  1. Distinguished name
  2. Create home directory

If ‘Create home directory’ is checked, when the user is imported their home directory in HDFS will automatically be created, if it doesn’t already exist.

If ‘Distinguished name’ is checked, then the username provided must be a full distinguished name (eg: uid=hue,ou=People,dc=gethue,dc=com). Otherwise, the Username provided should be a fragment of a Relative Distinguished Name (rDN) (e.g., the username “hue” maps to the rDN “uid=hue”). Hue will perform an LDAP search using the same methods and configurations as defined in the “LDAP search” section. Essentially, Hue will take the provided username and create a search filter using the ‘user_filter’ and ‘user_name_attr’ configurations. For more information on how Hue performs LDAP searches, see the “LDAP Search” section.

The case sensitivity of the search and import processes are defined in the “Case sensitivity” section.

3.    Importing groups

Groups are importable via the Useradmin interface. Then, users can be added to this group, which would provide a set of permissions (e.g. accessing the Impala application). This function works almost the exact same way as user importing, but has a couple of extra features.

Ekran Resmi 2015-08-29 17.18.06

As the above image portrays, not only can groups be discovered via DN and rDN search, but users that are members of the group and members of the group’s subordinate groups can be imported as well. Posix groups and members are automatically imported if the group found has the object class ”posixGroup”.

4.    Synchronizing users and groups

Users and groups can be synchronized with the directory service via the Useradmin interface or via a command line utility. The images from the previous sections use the words “Sync” to indicate that when a name of a user or group that exists in Hue is being added, it will in fact be synchronized instead. In the case of importing users for a particular group, new users will be imported and existing users will be synchronized. Note: Users that have been deleted from the directory service will not be deleted from Hue. Those users can be manually deactivated from Hue via the Useradmin interface.

The groups of a user can be synced when he logs in (to keep its permission in sync):

1

2

3

4

[desktop]

  [[ldap]]

  # Synchronize a users groups when they login

  ## sync_groups_on_login=false

4.1.    Attributes synchronized

Currently, only the first name, last name, and email address are synchronized. Hue looks for the LDAP attributes ‘givenName’, ‘sn’, and ‘mail’ when synchronizing.  Also, the ‘user_name_attr’ config is used to appropriately choose the username in Hue. For instance, if ‘user_name_attr’ is set to “uid”, then the “uid” returned by the directory service will be used as the username of the user in Hue.

4.2.    Useradmin interface

The “Sync LDAP users/groups” button in the Useradmin interface will  automatically synchronize all users and groups.

Ekran Resmi 2015-08-29 17.18.16

4.3.    Command line interface

Here’s a quick example of how to use the command line interface to synchronize users and groups:

<hue root>/build/env/bin/hue sync_ldap_users_and_groups

5.    LDAP search

There are two configurations for restricting the search process:

  1. user_filter – General LDAP filter to restrict the search.
  2. user_name_attr – Which attribute will be considered the username to search against.

Here is an example configuration:

1

2

3

4

5

6

7

8

[desktop]

    [[ldap]]

    [[[users]]]

    user_filter=”objectClass=*”

    user_name_attr=uid

    # Whether or not to follow referrals

    ## follow_referrals=false

With the above configuration, the LDAP search filter will take on the form:

(&(objectClass=*)(uid=<user entered usename>))

6.    Case sensitivity

Hue can be configured to ignore the case of usernames as well as force usernames to lower case via the ‘ignore_username_case’ and ‘force_username_lowercase’ configurations. These two configurations are recommended to be used in conjunction with each other. This is useful when integrating with a directory service containing usernames in capital letters and unix usernames in lowercase letters (which is a Hadoop requirement). Here is an example of configuring them:

[desktop]

1

2

3

4

[desktop]

    [[ldap]]

    ignore_username_case=true

    force_username_lowercase=true

7.    LDAPS/StartTLS support

Secure communication with LDAP is provided via the SSL/TLS and StartTLS protocols. It allows Hue to validate the directory service it’s going to converse with. Practically speaking, if a Certificate Authority Certificate file is provided, Hue will communicate via LDAPS:

1

2

3

[desktop]

    [[ldap]]

    ldap_cert=/etc/hue/ca.crt

The StartTLS protocol can be used as well (step up to SSL/TLS):

1

2

3

[desktop]

    [[ldap]]

    use_start_tls=true

8.    Debugging

Get more information when querying LDAP and use the ldapsearch tool:

1

2

3

4

5

6

7

8

9

10

[desktop]

    [[ldap]]

    debug=true

    # Sets the debug level within the underlying LDAP C lib.

    ## debug_level=255

    # Possible values for trace_level are 0 for no logging, 1 for only logging the method calls with arguments,

    # 2 for logging the method calls with arguments and the complete results and 9 for also logging the traceback of method calls.

    trace_level=0

Note

Make sure to add to the Hue server environment:

1

2

DESKTOP_DEBUG=true

DEBUG=true

9.    Notes

  1. Setting “search_bind_authentication=true” in the hue.ini will tell Hue to perform an LDAP search using the bind credentials specified in the hue.ini (bind_dn, bind_password). Hue will then search using the base DN specified in “base_dn” for an entry with the attribute, defined in “user_name_attr”, with the value of the short name provided in the login page. The search filter, defined in “user_filter” will also be used to limit the search. Hue will search the entire subtree starting from the base DN.
  2. Setting  ”search_bind_authentication=false” in the hue.ini will tell Hue to perform a direct bind to LDAP using the credentials provided (not bind_dn and bind_password specified in the hue.ini). There are two effective modes here:
    1. nt_domain is specified in the hue.ini: This is used to connect to an Active Directory directory service. In this case, the UPN (User Principal Name) is used to perform a direct bind. Hue forms the UPN by concatenating the short name provided at login and the nt_domain like so: “<short name>@<nt_domain>”. The ‘ldap_username_pattern’ config is completely ignore.
    2. nt_domain is NOT specified in the hue.ini: This is used to connect to all other directory services (can even handle Active Directory, but nt_domain is the preferred way for AD). In this case, ‘ldap_username_pattern’ is used and it should take on the form “cn=<username>,dc=example,dc=com” where <username> will be replaced with whatever is provided at the login page.
  3. The UserAdmin app will always perform an LDAP search when manage LDAP entries and will then always use the “bind_dn”, “bind_password”, “base_dn”, etc. as defined in the hue.ini.
  4. At this point in time, there is no other bind semantics supported other than SIMPLE_AUTH. For instance, we do not yet support MD5-DIGEST, NEGOTIATE, etc. Though, we definitely want to hear from folks what they use so we can prioritize these things accordingly!

10.    Summary

The Hue team is working hard on improving security. Upcoming LDAP features include: Import nested LDAP groups and multidomain support for Active Directory. We hope this brief overview of LDAP in Hue will help you make your system more secure, more compliant with current security standards, and open up big data analysis to many more users!

Categories: big data, hue Tags: , ,

Group Synchronization Backends in Hue

August 29, 2015 Leave a comment

Hueis the turn-key solution for Apache Hadoop. It hides the complexity of the ecosystem including HDFS, Oozie, MapReduce, etc. Hue provides authentication and integrates with SAMLLDAP, and other systems. A new feature added in Hue is the ability to synchronize groups with a third party authority provider. In this blog post, we’ll be covering the basics of creating a Group Synchronization Backend.

The Design

The purpose of the group synchronization backends are to keep Hue’s internal group lists fresh. The design was separated into two functional parts:

  1. A way to synchronize on every request.
  2. A definition of how and what to synchronize.

Ekran Resmi 2015-08-29 14.00.11

Image 1: Request cycle in Hue with a synchronization backend.

The first function is a Django middleware that is called on every request. It is intended to be immutable, but configurable. The second function is a backend that can be customized. This gives developers the ability to choose how their groups and user-group memberships can be synchronized. The middleware can be configured to use a particular synchronization backend and will call it on every request. If no backend is configured, then the middleware is disabled.

Creating Your Own Backend

A synchronization backend can be created by extending a class and providing your own logic. Here is an example backend that comes packaged with Hue:

class LdapSynchronizationBackend(DesktopSynchronizationBackendBase): USER_CACHE_NAME = ‘ldap_use_group_sync_cache’ def sync(self, request): user = request.user if not user or not user.is_authenticated(): return if not User.objects.filter(username=user.username, userprofile__creation_method=str(UserProfile.CreationMethod.EXTERNAL)).exists(): LOG.warn(“User %s is not an Ldap user” % user.username) return # Cache should be cleared when user logs out. if self.USER_CACHE_NAME not in request.session: request.session[self.USER_CACHE_NAME] = import_ldap_users(user.username, sync_groups=True, import_by_dn=False) request.session.modified = True

In the above code snippet, the synchronization backend is defined by extending “DesktopSynchronizationBackendBase”. Then, the method “sync(self, request)” is overridden and provides the syncing logic.

Configuration

The synchronization middleware can be configured to use a backend by changing “desktop -> auth -> user_group_membership_synchronization_backend” to the full import path of your class. For example, setting this config to “desktop.auth.backend.LdapSynchronizationBackend” configures Hue to synchronize with the configured LDAP authority.

Design Intelligently

Backends in Hue are extremely powerful and can affect the performance of the server. So, they should be designed in such a fashion that they do not do any operations that block for long periods of time. Also, they should manage the following appropriately:

  1. Throttling requests to whatever service contains the group information.
  2. Ensuring users are authenticated.
  3. Caching if appropriate.

Summary

Hue is enterprise grade software ready to integrate with LDAP, SAML, etc. The newest feature, Group Synchronization, ensures corporate authority is fresh in Hue. It’s easy to configure and create backends and Hue comes with an LDAP backend.

Hue is undergoing heavy development and are welcoming external contributions! Have any suggestions? Feel free to tell us what you think through hue-user or @gethue.

Categories: big data, hue Tags: , ,

How-to: Manage Permissions in Hue

August 29, 2015 Leave a comment

Hue is a web interface for Apache Hadoop that makes common Hadoop tasks such as running MapReduce jobs, browsing HDFS, and creating Apache Oozie workflows, easier. (To learn more about the integration of Oozie and Hue, see this blog post.) In this post, we’re going to focus on how one of the fundamental components in Hue, Useradmin, has matured.

New User and Permission Features

User and permission management in Hue has changed drastically over the past year. Oozie workflows, Apache Hive queries, and MapReduce jobs can be shared with other users or kept private. Permissions exist at the app level. Access to particular apps can be restricted, as well as certain sections of the apps. For instance, access to the shell app can be restricted, as well as access to the Apache HBaseApache Pig, and Apache Flume shells themselves. Access privileges are defined for groups and users can be members of one or more groups.

Changes to Users, Groups, and Permissions

Hue now supports authentication against PAM, Spnego, and an LDAP server. Users and groups can be imported from LDAP and be treated like their non-external counterparts. The import is manual and is on a per user/group basis. Users can authenticate using different backends such as LDAP. Using the LDAP authentication backend will allow users to login using their LDAP password. This can be configured in /etc/hue/hue.ini by changing the ‘desktop.auth.backend’ setting to ‘desktop.auth.backend.LdapBackend’. The LDAP server to authenticate against can be configured through the settings under ‘desktop.ldap’.

Here’s an example:

A company would like to use the following LDAP users and groups in Hue:

  1. John Smith belonging to team A
  2. Helen Taylor belonging to team B

Assuming the following access requirements:

  1. Team A should be able to use Beeswax, but nothing else.
  2. Team B should only be able to see the Oozie dashboard with readonly permissions.

In Hue 1 the scenarios cannot be realistically addressed given the lack of groups.

In Hue 2 the scenarios can be addressed more appropriately. Users can be imported from LDAP by clicking “Add/Sync LDAP user” in Useradmin > Users:

Similarly, groups can be imported from LDAP by clicking “Add/Sync LDAP group” in Useradmin > Groups.

If a previously imported user’s information was updated recently, the information in Hue will need to be resynchronized. This can be achieved through the LDAP sync feature:

Part A of the example can be addressed by explicitly allowing access Beeswax for Team A. This is managed in the “Groups” tab of the Useradmin app:

The Team A group can be edited by clicking on its name, where access privileges for the group are selectable. Here, the “beeswax.access” permission would be selected and the others would be unselected:

Part B of the example can be handled by explicitly defining access for Team B. This can be accomplished by following the same steps in part A, except for Team B. Every permission would be unselected except “oozie.dashboard_jobs_access”:

By explicitly setting the app level permissions, the apps that these users will be able to see will change. For instance, Helen, who is a member of Team B, will only see the Oozie app available:

Summary

User management has been revamped, groups were added, and various backends are exposed. One such backend, LDAP, facilitates synchronization of users and groups. App-level permissions allow administrators to control who can access certain apps and what documents can be shared.

Hue is maturing quickly and many more features are on their way. Hue will soon have document-level permissions (workflows, queries, and so on), trash functionality, and improvements to the existing editors.

Have any suggestions? Feel free to tell us what you think through hue-user.

Categories: bilg data, hue Tags: , ,

High Availability of Hue

August 29, 2015 Leave a comment

Note: as of January 2015 in Hue master or CDH5.4, this post is deprecated by Automatic High Availability with Hue and Cloudera Manager.

Very few projects within the Hadoop umbrella have as much end user visibility as Hue. Thus, it is useful to add a degree of fault tolerance to deployments. This blog post describes how to achieve a higher level of availability (HA) by placing several Hue instances behind a load balancer.

Tutorial

This tutorial demonstrates how to setup high availability by:

  1. Installing Hue 2.3 on two nodes in a three-node RedHat 5 cluster.
  2. Managing all Hue instances via Cloudera Manager 4.7.
  3. Load balancing using HA Proxy 1.4. In reality, any load balancer with sticky sessions should work.

Here is a video summary of the new features:

Installing Hue

Hue should be installed on two of the three nodes. To have Cloudera Manager automatically install Hue, follow the “Parcel Install via Cloudera Manager” section. To install manually, follow the “Package Install” section.

Parcel Install via Cloudera Manager

For more information on Parcels, see Managing Parcels.

  1. From Cloudera Manager, click on “Hosts” in the menu. Then, go to the “Parcels” section.
  2. Find the latest CDH parcel, click “Download”.
  3. Once the parcel has finished downloading, click “Distribute”.
  4. Once the parcel has finished distributing, click “Activate”.

Package Install

  1. Download the yum repository RPM.
  2. Install the yum repository using “sudo yum —nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm”. For more information, see Installing CDH4.
  3. Install Hue on each node using the command “sudo yum install hue” via the command line interface. For more information on installing Hue, see CDH documentation.

Managing Hue through Cloudera Manager

Cloudera Manager provides management of the Hue servers on each node. Add two Hue services using the directions below. For more information on managing services, see the Cloudera Manager documentation.

  1. Go to “Services -> All Services” in the menu.
  2. Click “Actions -> Add a Service”.
  3. Select “Hue” and follow the steps on the screen. NOTE: For each Hue service we choose a unique host.
  4. Ensure that the “Jobsub Examples and Templates Directory” configuration points to different directories in HDFS for each Hue service. It can be changed by going to Services -> <hue service>. In the menu, go to Configuration -> View and Edit. Then, click on “Hue Server”. “Jobsub Examples and Templates Directory” should be at the bottom of the page.

Ekran Resmi 2015-08-29 13.52.04

Image 1: Cloudera Manager handling two Hue services.

HA Proxy Installation/Configuration

  1. Download and unzip the binary distribution of HA Proxy 1.4 on the node that doesn’t have Hue installed.
  2. Add the following HA Proxy configuration to /tmp/hahue.conf:

global daemon nbproc 1 maxconn 100000 log 127.0.0.1 local6 debug defaults option http-server-close mode http timeout http-request 5s timeout connect 5s timeout server 10s timeout client 10s listen Hue 0.0.0.0:80 log global mode http stats enable balance source server hue1 servera.cloudera.com:8888 cookie ServerA check inter 2000 fall 3 server hue2 serverb.cloudera.com:8888 cookie ServerB check inter 2000 fall 3

  1. Start HA Proxy:

haproxy -f /tmp/hahue.conf

The key configuration options are balance and server in the listen section. When the balance parameter is set to source, a client is guaranteed to communicate with the same server every time it makes a request. If the server the client is communicating with goes down, the request will automatically be sent to another active server. This is necessary because Hue stores session information in process memory. The server parameters define which servers will be used for load balancing and takes on the form:

server [:port] [settings …]

In the configuration above, the server “hue1” is available at “servera.cloudera.com:8888” and “hue2” is available at “serverb.cloudera.com:8888”. Both servers have health checks every two seconds and are declared down after three failed health checks. In this example, HAProxy is configured to bind to “0.0.0.0:80”. Thus, Hue should now be available at “http://serverc.cloudera.com”.

Conclusion

Hue can be load balanced easily as long as the server a client is directed to is constant (i.e.: sticky sessions). It can improve performance, but the primary goal is high availability. Also, multiple Hue instances can be easily managed through Cloudera Manager. For true High Availability, Hue needs to be configured to use HA MySQLPostGreSQL, or Oracle.

Coming up, there will be a blog post on JobTracker HA with Hue. Have any suggestions? Feel free to tell us what you think throughhue-user.

Categories: big data, hue Tags: ,

Exploring Apache Spark on the New BigDataLite 3.0 VM “http://www.rittmanmead.com/2014/05/exploring-apache-spark-on-the-new-bigdatalite-3-0-vm/”

August 18, 2015 Leave a comment

The latest version of Oracle’s BigDataLite VirtualBox VM went up on OTN last week, and amongst other things it includes the latest CDH5.0 version of Cloudera Distribution including Hadoop, as featured on Oracle Big Data Appliance. This new version comes with an update to MapReduce, moving it to MapReduce 2.0 (with 1.0 still there for backwards-compatibility) and with YARN as the replacement for the Hadoop JobTracker. If you’re developing on CDH or the BigDataLite VM you shouldn’t notice any differences with the move to YARN, but it’s a more forward-looking, modular way of tracking and allocating resources on these types of compute clusters that also opens them up to processing models other than MapReduce.

The other new Hadoop feature that you’ll notice with BigDataLite VM and CDH5 is an updated version of Hue, the web-based development environment you use for creating Hive queries, uploading files and so on. As the version of Hue shipped is now Hue 3.5, there’s proper support for quoted CSV files in the Hue / Hive uploader (hooray) along with support for stripping the first, title, line, and an updated Pig editor that prompts you for command syntax (like the Hortonworks Pig editor).

NewImage

This new version of BigDataLite also seems to have had Cloudera Manager removed (or at least, it’s not available as usual at http://bigdatalite:7180), with instead a utility provided on the desktop that allows you to stop and start the various VM services, including the Oracle Database and Oracle NoSQL database that also come with the VM. Strictly-speaking its actually easier to use than Cloudera Manager, but it’s a shame it’s gone as there’s lots of monitoring and configuration tools in the product that I’ve found useful in the past.

NewImage

CDH5 also comes with Apache Spark, a cluster processing framework that’s being positioned as the long-term replacement for MapReduce. Spark is technically a programming model that allows developers to create scripts, or programs, that bring together operators such as filters, aggregators, joiners and group-bys using languages such as Scala and Python, but crucially this can all happen in-memory – making Spark potentially much faster than MapReduce for doing both batch, and ad-hoc analysis.

This article on the Cloudera blog goes into more detail on what Apache Spark is and how it improves over MapReduce, and this article takes a look at the Spark architecture and how it’s approach avoids the multi-stage execution model that MapReduce uses (something you’ll see if you ever do a join in Pig or Hive). But what does some basic Spark code look like, using the default Scala language most people associate Spark with? Let’s take a look at some sample code, using the same Rittman Mead Webserver log files I used in the previous Pig and Hive/Impala articles.

You can start up Spark in interactive mode, like you do with Pig and Grunt, by opening a Terminal session and typing in “spark-shell”:

Spark has the concept of RDDs, “Resilient Distributed Datasets” that can be thought of as similar to relations in Pig, and tables in Hive, but which crucially can be cached in RAM for improved performance when you need to access their dataset repeatedly. Like Pig, Spark features “lazy execution”, only processing the various Spark commands when you actually need to (for example, when outputting the results of a data-flow”, so let’s run two more commands to load in one of the log files on HDFS, and then count the log file lines within it.

So the file contains 154563 records. Running the logfile.count() command again, though, brings back the count immediately as the RDD has been cached; we can explicitly cache RDDs directly in these commands if we like, by using the “.cache” method:

So let’s try some filtering, retrieving just those log entries where the user is requesting our BI Apps 11g homepage (“/biapps11g/“):

Or I can create a dataset containing just those records that have a 404 error code:

Spark, using Scala as the language, has routines for filtering, joining, splitting and otherwise transforming data, but something that’s quite common in Spark is to create Java JAR files, typically from compiled Scala code, to encapsulate certain common data transformations, such as this Apache CombinedFormat Log File parser available on Github from @alvinalexander, author of the Scala Cookbook. Once you’ve compiled this into a JAR file and added it to your SPARK_CLASSPATH (see his blog post for full details, and from where the Spark examples below were taken from), you can start to work with the individual log file elements, like we did in the Hive and Pig examples where we parsed the log file using Regexes.

Then I can access the HTTP Status Code using its own property, like this:

Then we can use a similar method to retrieve all of the “request” entries in the log file where the user got a 404 error, starting off by defining two methods that will help with the request parsing – note the use of the :paste command which allows you to block-paste a set of commands into the scala-shell:

Now we can run the code to output the URIs that have been generating 404 errors:

So Spark is commonly-considered the successor to MapReduce, and you can start playing around with it on the new BigDataLite VM. Unless you’re a Java (or Scala, or Python) programmer, Spark isn’t quite as easy as Pig or Hive to get into, but the potential benefits over MapReduce are impressive and it’d be worth taking a look. Hopefully we’ll have more on Spark on the blog over the next few months, as we get to grips with it properly.

Categories: Uncategorized
Follow

Get every new post delivered to your Inbox.

Join 534 other followers