Hadoop

From Exterior Memory
Jump to: navigation, search

Hadoop Clusters

HortonWorks Sandbox

To get started, install the HortonWorks HDP Sandbox and run Hadoop ona local VM (e.g. with VirtualBox.). Be sure to lower the memory if your host OS is not as juicy as a full-blown server.

Your own cluster

With Amazon or any other cloud provider it is possible to run your own Hadoop cluster.

Hathi

Hathi is the Hadoop cluster at SURFsara, which is the one I use most.

See https://github.com/sara-nl/hathi-client for the client tools, and https://userinfo.surfsara.nl/systems/hadoop/usage details how to use it.

Hathi with Kerberos

Kerberos provides strong authentication, and is used at the Hathi Hadoop cluster for authentication.

I had to take the following tree steps before I could properly authenticate.

Set Password

Error: kinit: Password incorrect while getting initial credentials

Resolution: I had to reset my password at https://portal.surfsara.nl, before I could run kinit.

Install Java Cryptography Extension

Error:

WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over head10.hathi.surfsara.nl/145.100.41.120:8020 after 1 fail over attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "surfsara-freek.local/192.168.56.1"; destination host is: "head10.hathi.surfsara.nl":8020;

Resolution: I had to download and install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files. This needs to be done like this:

cp ~/Downloads/UnlimitedJCEPolicy/*.jar /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security/

Configure Kerberos

In Firefox, go to about:config, and lookup the key network.negotiate-auth.trusted-uris and add the value hathi.surfsara.nl. If there is already a value, separate the different hostnames with a comma.

Error: When browsing to http://head05.hathi.surfsara.nl:8088/cluster, my browser (Firefox or Safari) hangs for a minute, after which it returns the error:

HTTP ERROR 401
Problem accessing /node. Reason: Authentication required

The reason is that Kerberos is not configured.

Resolution: I had to copy the content of https://github.com/sara-nl/hathi-client/blob/master/conf/krb5.conf to ~/Library/Preferences/edu.mit.Kerberos or /etc/krb5.conf

Note: When I set up a VPN connection, Kerberos did work - it turns out that Kerberos discovers the key distribution center (KDC server) using DNS options.

Hadoop Basics

Hadoop File System (HDFS)

hdfs dfs -ls /

(this is similar to hadoop fs -ls /, but that seem deprecated)

Submitting jobs

yarn jar TitleCount.jar TitleCount "-D stopwords=/mp2/misc/stopwords.txt -D delimiters=/mp2/misc/delimiters.txt /mp2/titles" /mp2/TitleCount-output

Old way:

hadoop jar TitleCount.jar TitleCount "-D stopwords=/mp2/misc/stopwords.txt -D delimiters=/mp2/misc/delimiters.txt /mp2/titles" /mp2/TitleCount-output

Examining the Log of a Job

Retrieve the stdout, stderr and syslog with yarn: yarn logs -applicationId application_1383601692319_0008

Or examine the logs online: http://headnode.hadoop.example.org:8088/cluster, browse to the job, and examine the log. For some reason, this log usually contains less information than what I can retrieve with yarn.

MapReduce Programming

Setting input & output types

The general pipeline is:

   (InputKey, InputValue)
     ⬇︎
   Mapper
     ⬇︎
   (IntermediateKey, IntermediateValue)
     ⬇︎
   Reducer
     ⬇︎
   (OutputKey, OutputValue)

In Job definition:

Set Output type of reduce (and mapper)

   job.setOutputKeyClass(OutputKey.class);
   job.setOutputValueClass(OutputValue.class);

Set output type of mapper (and thus input of reduce). Useful if output type of map and reduce is different.

   job.setMapOutputKeyClass(IntermediateKey.class);
   job.setMapOutputValueClass(IntermediateValue.class);

Input and output format define how text is read from and written to file.

   job.setInputFormatClass(TextInputFormat.class); // Default: <Object, Text>
   job.setInputFormatClass(KeyValueTextInputFormat.class); // Alternate: <Text, Text>
   jobB.setOutputFormatClass(TextOutputFormat.class); // Default

Also define in the Mapper and Reduce class and methods:

   public static class TopTitlesMap extends Mapper<InputKey, InputValue, IntermediateKey, IntermediateValue> {
       public void map(InputKey key, InputValue value, Context context) throws IOException, InterruptedException {
       }
   }
   
   public static class TopTitlesReduce extends Reducer<IntermediateKey, IntermediateValue, OutputKey, OutputValue> {
       public void reduce(IntermediateKey key, Iterable<IntermediateValue> values, Context context) throws IOException, InterruptedException {
       }
   }