To get started, install the HortonWorks HDP Sandbox and run Hadoop ona local VM (e.g. with VirtualBox.). Be sure to lower the memory if your host OS is not as juicy as a full-blown server.
Your own cluster
With Amazon or any other cloud provider it is possible to run your own Hadoop cluster.
Hathi is the Hadoop cluster at SURFsara, which is the one I use most.
Hathi with Kerberos
Kerberos provides strong authentication, and is used at the Hathi Hadoop cluster for authentication.
I had to take the following tree steps before I could properly authenticate.
kinit: Password incorrect while getting initial credentials
Resolution: I had to reset my password at https://portal.surfsara.nl, before I could run kinit.
Install Java Cryptography Extension
WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over head10.hathi.surfsara.nl/188.8.131.52:8020 after 1 fail over attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "surfsara-freek.local/192.168.56.1"; destination host is: "head10.hathi.surfsara.nl":8020;
Resolution: I had to download and install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files. This needs to be done like this:
cp ~/Downloads/UnlimitedJCEPolicy/*.jar /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security/
In Firefox, go to about:config, and lookup the key
network.negotiate-auth.trusted-uris and add the value
hathi.surfsara.nl. If there is already a value, separate the different hostnames with a comma.
Error: When browsing to http://head05.hathi.surfsara.nl:8088/cluster, my browser (Firefox or Safari) hangs for a minute, after which it returns the error:
HTTP ERROR 401
Problem accessing /node. Reason: Authentication required
The reason is that Kerberos is not configured.
Resolution: I had to copy the content of https://github.com/sara-nl/hathi-client/blob/master/conf/krb5.conf to
Hadoop File System (HDFS)
hdfs dfs -ls /
(this is similar to
hadoop fs -ls /, but that seem deprecated)
yarn jar TitleCount.jar TitleCount "-D stopwords=/mp2/misc/stopwords.txt -D delimiters=/mp2/misc/delimiters.txt /mp2/titles" /mp2/TitleCount-output
hadoop jar TitleCount.jar TitleCount "-D stopwords=/mp2/misc/stopwords.txt -D delimiters=/mp2/misc/delimiters.txt /mp2/titles" /mp2/TitleCount-output