Radoop on Amazon EMR fails to initialize

User: "Andrew"
New Altair Community Member
Updated by Jocelyn

I'm very close to being able to get Radoop working with an Amazon EMR cluster. My set up involves RapidMiner Studio and Radoop on a Windows laptop which has full unfettered firewall access to the EMR machines. I am not using SOCKS (although I started with this). I am using the absolute latest Spark, Hive and Hadoop components that Amazon makes available.

 

The full connection test fails at the point where components are being uploaded to the /tmp/radoop/_shared/db_default/ HDFS location. I can see that the data nodes are being contacted on port 50010 and it looks like this fails from my laptop because the ip addresses are not known. I have tried the dfs.client.use.datanode.hostname true/false workaround and I see this changes the name that it attempts to use - in one setting the node is <name>/<ipaddress>:50010 (which is odd) while in the other it is <ipaddress>:50010 (which is believable but doesn't resolve).

 

I don't have the luxury of installing RapidMiner components on the EMR cluster so my question is what is the best way to get the name nodes exposed to the PC running RapidMiner Studio and Radoop?

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "Andrew"
    New Altair Community Member
    OP
    Accepted Answer

    Hello Peter,

    I'm happy to say the Spark suggestion worked and now I can get Radoop connections working completely.

     

    As promised here is the list of things to do to get to this happy place.

     

    Create an EMR cluster and use the advanced options to select Hadoop, Pig, Spark, Hive and Mahout.

     

    Log on to the master node and determine the internal IP address of the eth0 interface using the command line. 

    ifconfig

     

    While logged in, there are some configuration steps needed to make the environment work. These are described in the Radoop documentation here. I observed that Java did not need any special configuration, EMR is up to date. The commands to create various staging locations in HDFS are required. I've repeated them below 

    hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history
    hadoop fs -chmod -R 777 /tmp/hadoop-yarn
    hadoop fs -mkdir /user
    hadoop fs -chmod 777 /user

    An earlier version of Spark needs to be installed. Here are the steps.

     

    wget -O /home/hadoop/spark-1.6.3-bin/hadoop2.6.tgz  https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz
    cd /home/hadoop
    tar -xzvf spark-1.6.3-bin-hadoop2.6.tgz
    hadoop fs -put spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar /tmp/

     

    Continue to follow the instructions to set up the network connection. Use the IP address found above as the NameNode address, Resource Manager Address and JobHistory Server Address. Don't be tempted to use any other name or IP address since it will not work.

     

    Set the Hive Server address to localhost.

     

    Set the Hive port to 1235.

     

    Set the Spark version to Spark 1.6 and set the assembly jar location to

    hdfs:///tmp/spark-assembly-1.6.3-hadoop2.6.0.jar

     

    Set the advanced Hadoop parameters as follows

    dfs.client.use.legacy.blockreader  true
    hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.SocksSocketFactory
    hadoop.socks.server localhost:1234

    Now create the SOCKS connection. On Linux the command is like this.

    ssh -i <yourkey>.pem -N -D 1234 -L localhost:1235:<ifconfig ip address>:10000  hadoop@<nameofmaster>

    In the command above, things between <> need to be provided by information from the environment you are in.

     

    On Windows, use Putty to create the SOCKS connection.  The Radoop documentation gives a nice picture here. Make sure you replace hive-internal-address with the ipaddress determined using the ifconfig command.

     

    Now you can run the Radoop connection tests and with luck, all will be well...

     

    yay!

     

    Andrew