Hadoop

From Exterior Memory
Revision as of 19:36, 11 September 2015 by MacFreek (Talk | contribs) (Created page with "== Hadoop Clusters == === HortonWorks Sandbox === To get started, install the [http://hortonworks.com/products/hortonworks-sandbox/ HortonWorks HDP Sandbox] and run Hadoop o...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Hadoop Clusters

HortonWorks Sandbox

To get started, install the HortonWorks HDP Sandbox and run Hadoop ona local VM (e.g. with VirtualBox.). Be sure to lower the memory if your host OS is not as juicy as a full-blown server.

Your own cluster

With Amazon or any other cloud provider it is possible to run your own Hadoop cluster.

Hathi

Hathi is the Hadoop cluster at SURFsara, which is the one I use most.

See https://github.com/sara-nl/hathi-client for the client tools, and https://userinfo.surfsara.nl/systems/hadoop/usage details how to use it.

Hathi with Kerberos

Kerberos provides strong authentication, and is used at the Hathi Hadoop cluster for authentication.

I had to take the following tree steps before I could properly authenticate.

Set Password

Error: kinit: Password incorrect while getting initial credentials

Resolution: I had to reset my password at https://portal.surfsara.nl, before I could run kinit.

Install Java Cryptography Extension

Error:

WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over head10.hathi.surfsara.nl/145.100.41.120:8020 after 1 fail over attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "surfsara-freek.local/192.168.56.1"; destination host is: "head10.hathi.surfsara.nl":8020;

Resolution: I had to download and install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files. This needs to be done like this:

cp ~/Downloads/UnlimitedJCEPolicy/*.jar /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security/

Configure Kerberos

In Firefox, go to about:config, and lookup the key network.negotiate-auth.trusted-uris and add the value hathi.surfsara.nl. If there is already a value, separate the different hostnames with a comma.

Error: When browsing to http://head05.hathi.surfsara.nl:8088/cluster, my browser (Firefox or Safari) hangs for a minute, after which it returns the error:

HTTP ERROR 401
Problem accessing /node. Reason: Authentication required

The reason is that Kerberos is not configured.

Resolution: I had to copy the content of https://github.com/sara-nl/hathi-client/blob/master/conf/krb5.conf to ~/Library/Preferences/edu.mit.Kerberos or /etc/krb5.conf


Hadoop Basics

Hadoop File System (HDFS)

hdfs dfs -ls /

(this is similar to hadoop fs -ls /, but that seem deprecated)

Submitting jobs

yarn jar TitleCount.jar TitleCount "-D stopwords=/mp2/misc/stopwords.txt -D delimiters=/mp2/misc/delimiters.txt /mp2/titles" /mp2/TitleCount-output

Old way:

hadoop jar TitleCount.jar TitleCount "-D stopwords=/mp2/misc/stopwords.txt -D delimiters=/mp2/misc/delimiters.txt /mp2/titles" /mp2/TitleCount-output