Hadoop
Contents
Hadoop Clusters
HortonWorks Sandbox
To get started, install the HortonWorks HDP Sandbox and run Hadoop ona local VM (e.g. with VirtualBox.). Be sure to lower the memory if your host OS is not as juicy as a full-blown server.
Your own cluster
With Amazon or any other cloud provider it is possible to run your own Hadoop cluster.
Hathi
Hathi is the Hadoop cluster at SURFsara, which is the one I use most.
See https://github.com/sara-nl/hathi-client for the client tools, and https://userinfo.surfsara.nl/systems/hadoop/usage details how to use it.
Hathi with Kerberos
Kerberos provides strong authentication, and is used at the Hathi Hadoop cluster for authentication.
I had to take the following tree steps before I could properly authenticate.
Set Password
Error: kinit: Password incorrect while getting initial credentials
Resolution: I had to reset my password at https://portal.surfsara.nl, before I could run kinit.
Install Java Cryptography Extension
Error:
-
WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
-
INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over head10.hathi.surfsara.nl/145.100.41.120:8020 after 1 fail over attempts. Trying to fail over immediately.
-
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "surfsara-freek.local/192.168.56.1"; destination host is: "head10.hathi.surfsara.nl":8020;
Resolution: I had to download and install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files. This needs to be done like this:
-
cp ~/Downloads/UnlimitedJCEPolicy/*.jar /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security/
Configure Kerberos
In Firefox, go to about:config, and lookup the key network.negotiate-auth.trusted-uris
and add the value hathi.surfsara.nl
. If there is already a value, separate the different hostnames with a comma.
Error: When browsing to http://head05.hathi.surfsara.nl:8088/cluster, my browser (Firefox or Safari) hangs for a minute, after which it returns the error:
-
HTTP ERROR 401
-
Problem accessing /node. Reason: Authentication required
The reason is that Kerberos is not configured.
Resolution: I had to copy the content of https://github.com/sara-nl/hathi-client/blob/master/conf/krb5.conf to ~/Library/Preferences/edu.mit.Kerberos
or /etc/krb5.conf
Hadoop Basics
Hadoop File System (HDFS)
hdfs dfs -ls /
(this is similar to hadoop fs -ls /
, but that seem deprecated)
Submitting jobs
yarn jar TitleCount.jar TitleCount "-D stopwords=/mp2/misc/stopwords.txt -D delimiters=/mp2/misc/delimiters.txt /mp2/titles" /mp2/TitleCount-output
Old way:
hadoop jar TitleCount.jar TitleCount "-D stopwords=/mp2/misc/stopwords.txt -D delimiters=/mp2/misc/delimiters.txt /mp2/titles" /mp2/TitleCount-output