Hadoop

HortonWorks Sandbox
To get started, install the HortonWorks HDP Sandbox and run Hadoop ona local VM (e.g. with VirtualBox.). Be sure to lower the memory if your host OS is not as juicy as a full-blown server.

Your own cluster
With Amazon or any other cloud provider it is possible to run your own Hadoop cluster.

Hathi
Hathi is the Hadoop cluster at SURFsara, which is the one I use most.

See https://github.com/sara-nl/hathi-client for the client tools, and https://userinfo.surfsara.nl/systems/hadoop/usage details how to use it.

Hathi with Kerberos
Kerberos provides strong authentication, and is used at the Hathi Hadoop cluster for authentication.

I had to take the following tree steps before I could properly authenticate.

Set Password
Error:

Resolution: I had to reset my password at https://portal.surfsara.nl, before I could run kinit.

Install Java Cryptography Extension
Error:

Resolution: I had to download and install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files. This needs to be done like this:

Configure Kerberos
In Firefox, go to about:config, and lookup the key  and add the value. If there is already a value, separate the different hostnames with a comma.

Error: When browsing to http://head05.hathi.surfsara.nl:8088/cluster, my browser (Firefox or Safari) hangs for a minute, after which it returns the error:

The reason is that Kerberos is not configured.

Resolution: I had to copy the content of https://github.com/sara-nl/hathi-client/blob/master/conf/krb5.conf to  or

Note: When I set up a VPN connection, Kerberos did work - it turns out that Kerberos discovers the key distribution center (KDC server) using DNS options.

Hadoop File System (HDFS)
(this is similar to, but that seem deprecated)

Submitting jobs
Old way:

Examining the Log of a Job
Retrieve the stdout, stderr and syslog with yarn:

Or examine the logs online: http://headnode.hadoop.example.org:8088/cluster, browse to the job, and examine the log. For some reason, this log usually contains less information than what I can retrieve with yarn.

Setting input & output types
The general pipeline is:

(InputKey, InputValue) ⬇︎   Mapper ⬇︎   (IntermediateKey, IntermediateValue) ⬇︎   Reducer ⬇︎   (OutputKey, OutputValue)

In Job definition:

Set Output type of reduce (and mapper)

job.setOutputKeyClass(OutputKey.class); job.setOutputValueClass(OutputValue.class);

Set output type of mapper (and thus input of reduce). Useful if output type of map and reduce is different.

job.setMapOutputKeyClass(IntermediateKey.class); job.setMapOutputValueClass(IntermediateValue.class);

Input and output format define how text is read from and written to file.

job.setInputFormatClass(TextInputFormat.class); // Default:  job.setInputFormatClass(KeyValueTextInputFormat.class); // Alternate:  jobB.setOutputFormatClass(TextOutputFormat.class); // Default

Also define in the Mapper and Reduce class and methods:

public static class TopTitlesMap extends Mapper { public void map(InputKey key, InputValue value, Context context) throws IOException, InterruptedException { }   }    public static class TopTitlesReduce extends Reducer { public void reduce(IntermediateKey key, Iterable values, Context context) throws IOException, InterruptedException { }   }