Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Radoop on Amazon EMR fails to initialize

I'm very close to being able to get Radoop working with an Amazon EMR cluster. My set up involves RapidMiner Studio and Radoop on a Windows laptop which has full unfettered firewall access to the EMR machines. I am not using SOCKS (although I started with this). I am using the absolute latest Spark, Hive and Hadoop components that Amazon makes available.

The full connection test fails at the point where components are being uploaded to the /tmp/radoop/_shared/db_default/ HDFS location. I can see that the data nodes are being contacted on port 50010 and it looks like this fails from my laptop because the ip addresses are not known. I have tried the dfs.client.use.datanode.hostname true/false workaround and I see this changes the name that it attempts to use - in one setting the node is <name>/<ipaddress>:50010 (which is odd) while in the other it is <ipaddress>:50010 (which is believable but doesn't resolve).

I don't have the luxury of installing RapidMiner components on the EMR cluster so my question is what is the best way to get the name nodes exposed to the PC running RapidMiner Studio and Radoop?

Find more posts tagged with

AI Studio

AI Studio Radoop

AWS Azure

Accepted answers

Andrew

Hello Peter,

I'm happy to say the Spark suggestion worked and now I can get Radoop connections working completely.

As promised here is the list of things to do to get to this happy place.

Create an EMR cluster and use the advanced options to select Hadoop, Pig, Spark, Hive and Mahout.

Log on to the master node and determine the internal IP address of the eth0 interface using the command line.

ifconfig

While logged in, there are some configuration steps needed to make the environment work. These are described in the Radoop documentation here. I observed that Java did not need any special configuration, EMR is up to date. The commands to create various staging locations in HDFS are required. I've repeated them below

hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history
hadoop fs -chmod -R 777 /tmp/hadoop-yarn
hadoop fs -mkdir /user
hadoop fs -chmod 777 /user

An earlier version of Spark needs to be installed. Here are the steps.

wget -O /home/hadoop/spark-1.6.3-bin/hadoop2.6.tgz  https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz
cd /home/hadoop
tar -xzvf spark-1.6.3-bin-hadoop2.6.tgz
hadoop fs -put spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar /tmp/

Continue to follow the instructions to set up the network connection. Use the IP address found above as the NameNode address, Resource Manager Address and JobHistory Server Address. Don't be tempted to use any other name or IP address since it will not work.

Set the Hive Server address to localhost.

Set the Hive port to 1235.

Set the Spark version to Spark 1.6 and set the assembly jar location to

hdfs:///tmp/spark-assembly-1.6.3-hadoop2.6.0.jar

Set the advanced Hadoop parameters as follows

dfs.client.use.legacy.blockreader  true
hadoop.rpc.socket.factory.class.default   org.apache.hadoop.net.SocksSocketFactory 
hadoop.socks.server   localhost:1234

Now create the SOCKS connection. On Linux the command is like this.

ssh -i <yourkey>.pem -N -D 1234 -L localhost:1235:<ifconfig ip address>:10000  hadoop@<nameofmaster>

In the command above, things between <> need to be provided by information from the environment you are in.

On Windows, use Putty to create the SOCKS connection. The Radoop documentation gives a nice picture here. Make sure you replace hive-internal-address with the ipaddress determined using the ifconfig command.

Now you can run the Radoop connection tests and with luck, all will be well...

yay!

Andrew

All comments

zprekopcsak

Hi Andrew,

You will need to use some networking trick, because the datanode IP addresses that you are receiving from the cluster are AWS internal IP addresses that your PC cannot route to. The dfs.client.use.datanode.hostname will not do the trick as Hadoop services are not exposed on the public-facing IPs.

If you can start another EC2 instance in the same local network (VPC in AWS lingo) as the EMR cluster, then I suggest installing a RapidMiner Server on that EC2 instance and enabling the Radoop Proxy. See here for more details: https://docs.rapidminer.com/radoop/installation/networking-setup.html#radoop-proxy

If you cannot start another instance then you either need to set up the SOCKS proxy or a VPN.

Best, Zoltan

Andrew

Hello Zoltan

I initially tried with SOCKS but I couldn't make it work, a mis-configuration of some sort. Can I be confident that it will eventually be possible using the SOCKS approach? I just need to be sure that I will get it working before I spend time on it. I promise to write about what I did.

regards

Andrew

I have almost got it working - the last part is now a failure in the Spark location

[Jun 9, 2017 12:11:17 PM] SEVERE: The Spark job could not succeed for any supported Spark Version. It seems that the specified assembly jar or its location is incorrect: local:///usr/lib/spark/jars

And yet on the EMR master node, I can see local jar files at that location. Is there a specific file that is needed?

phellinger

Hi Andrew,

I was able to reproduce your problem on EMR-5.6.0 with Spark 2.1.

It's important to note that Amazon is quite agile in pushing new EMR versions out :smileyhappy:, sometimes latest versions have changes that affects the initial RapidMiner connection setup. Let me take a look at this one, but it may take some time.

Meanwhile, you can always use Spark 1.6 on this cluster as well, just download it from http://spark.apache.org, put the assembly on HDFS and change the Radoop connection to point to that. For example, run these commands as hadoop user on the master (I hope I have no typos there):

wget -O /home/hadoop/spark-1.6.3-bin-hadoop2.6.tgz https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz
cd /home/hadoop
tar -xzvf spark-1.6.3-bin-hadoop2.6.tgz
hadoop fs -put spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar /tmp/

Screen Shot 2017-06-09 at 16.53.32.png

Best,

Peter

Screen Shot 2017-06-09 at 16.53.32.png

Andrew

Hello Peter,

I'm happy to say the Spark suggestion worked and now I can get Radoop connections working completely.

As promised here is the list of things to do to get to this happy place.

Create an EMR cluster and use the advanced options to select Hadoop, Pig, Spark, Hive and Mahout.

Log on to the master node and determine the internal IP address of the eth0 interface using the command line.

ifconfig

hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history
hadoop fs -chmod -R 777 /tmp/hadoop-yarn
hadoop fs -mkdir /user
hadoop fs -chmod 777 /user

An earlier version of Spark needs to be installed. Here are the steps.

wget -O /home/hadoop/spark-1.6.3-bin/hadoop2.6.tgz  https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz
cd /home/hadoop
tar -xzvf spark-1.6.3-bin-hadoop2.6.tgz
hadoop fs -put spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar /tmp/

Set the Hive Server address to localhost.

Set the Hive port to 1235.

Set the Spark version to Spark 1.6 and set the assembly jar location to

hdfs:///tmp/spark-assembly-1.6.3-hadoop2.6.0.jar

Set the advanced Hadoop parameters as follows

dfs.client.use.legacy.blockreader  true
hadoop.rpc.socket.factory.class.default   org.apache.hadoop.net.SocksSocketFactory 
hadoop.socks.server   localhost:1234

Now create the SOCKS connection. On Linux the command is like this.

ssh -i <yourkey>.pem -N -D 1234 -L localhost:1235:<ifconfig ip address>:10000  hadoop@<nameofmaster>

In the command above, things between <> need to be provided by information from the environment you are in.

Now you can run the Radoop connection tests and with luck, all will be well...

yay!

Andrew

oops - I made a typo in the instructions

it should be

wget -O /home/hadoop/spark-1.6.3-bin-hadoop2.6.tgz https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz

and also, the SOCKS instructions for Windows Putty are incorrect. The address to use is localhost - confusing - but it seems to work.

phellinger

Hi Andrew,

thanks for the great summary!

The only thing I did not get is the localhost address comment on Windows. Do you mean you had to use "localhost" as the address (with port 10000) instead of the Hive node's IP address? I would expect that to only work if the HiveServer2 ran on the master node.

Best,

Peter

Andrew

Hello Peter

I have these Putty settings.

Putty settings

If I change the local port 1235 setting to other likely candidate names or ip addresses, I get a failure in the Quick Test of the Radoop connection.

regards

Andrew

Capture.PNG

phellinger

We've made a small update to the Amazon EMR guide at https://docs.rapidminer.com/radoop/installation/distribution-notes.html.

Both Spark 1.x and Spark 2.x can be used easily. The most efficient configuration is described: upload Spark assembly / Spark jars to HDFS in a compressed format and provide the HDFS URL in the Radoop connection.

(The error came from the fact that Spark libraries are only installed on the master node, so the submitted jobs could not find them on worker nodes.)

Best,

Peter