Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Radoop Full Test failing

behroz89

I am new to Radoop and trying to setup a development enviornment. My setup is

- Virtual Machine (Ubuntu) running in Virtual Box (I am not using HDP Image)

- 5GB Ram is assinged to the VM

- Spark 2.0.0

- Hadoop 2.8.5

- Hive 2.3.3

My quick tests are all okay. When I run full tests, I get the following error

[Nov 4, 2018 7:50:46 PM]: Running test 17/25: Hive load data
[Nov 4, 2018 7:50:52 PM]: Test succeeded: Hive load data (6.356s)
[Nov 4, 2018 7:50:52 PM]: Running test 18/25: Import job
[Nov 4, 2018 7:51:07 PM] SEVERE: Test failed: Import job
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Import job
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Hive load data
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Radoop jar upload
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: HDFS upload
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Create permanent UDFs
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: UDF jar upload
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Spark assembly jar existence
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Spark staging directory
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: MapReduce staging directory
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Radoop temporary directory
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: MapReduce
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: HDFS
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: YARN services networking
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: DataNode networking
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: NameNode networking
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Java version
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Fetch dynamic settings
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Hive connection
[Nov 4, 2018 7:51:07 PM]: Total time: 22.634s
[Nov 4, 2018 7:51:07 PM]: java.lang.Exception: Import job failed, see the job logs on the cluster for details.
	at eu.radoop.connections.service.test.integration.TestHdfsImport.call(TestHdfsImport.java:95)
	at eu.radoop.connections.service.test.integration.TestHdfsImport.call(TestHdfsImport.java:40)
	at eu.radoop.connections.service.test.RadoopTestContext.lambda$runTest$1(RadoopTestContext.java:279)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

[Nov 4, 2018 7:51:07 PM] SEVERE: java.lang.Exception: Import job failed, see the job logs on the cluster for details.
[Nov 4, 2018 7:51:07 PM] SEVERE: Test data import from the distributed file system to Hive server 2 failed. Please check the logs of the MapReduce job on the ResourceManager web interface at http://${yarn.resourcemanager.hostname}:8088.
[Nov 4, 2018 7:51:07 PM] SEVERE: Test failed: Import job
[Nov 4, 2018 7:51:07 PM] SEVERE: Integration test for 'VirtualBoxVM' failed.

In Yarn container logs, I see the following error

Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

Further, If I run just the spark tests, then I get the following

My Spark Radoop settings ->

- Spark 2.0

- Assembly path -> hdfs:///spark/jars/*

- Resource Allocation Policy -> Static, Default Configuration

Logs

[Nov 4, 2018 7:55:44 PM]: Running test 3/4: HDFS upload
[Nov 4, 2018 7:55:44 PM]: Uploaded test data file size: 5642
[Nov 4, 2018 7:55:44 PM]: Test succeeded: HDFS upload (0.075s)
[Nov 4, 2018 7:55:44 PM]: Running test 4/4: Spark job
[Nov 4, 2018 7:55:44 PM]: Assuming Spark version Spark 2.0.
[Nov 4, 2018 7:56:38 PM]: Assuming Spark version Spark 1.4 or below.
[Nov 4, 2018 7:56:38 PM] SEVERE: Test failed: Spark job
[Nov 4, 2018 7:56:38 PM]: Cleaning after test: Spark job
[Nov 4, 2018 7:56:38 PM]: Cleaning after test: HDFS upload
[Nov 4, 2018 7:56:38 PM]: Cleaning after test: Spark staging directory
[Nov 4, 2018 7:56:38 PM]: Cleaning after test: Fetch dynamic settings
[Nov 4, 2018 7:56:38 PM]: Total time: 53.783s
[Nov 4, 2018 7:56:38 PM] SEVERE: com.rapidminer.operator.UserError: The specified Spark assembly jar, archive or lib directory does not exist or cannot be read.
[Nov 4, 2018 7:56:38 PM] SEVERE: The Spark test failed. Please verify your Hadoop and Spark version and check if your assembly jar location is correct. If the job failed, check the logs on the ResourceManager web interface  at http://${yarn.resourcemanager.hostname}:8088.
[Nov 4, 2018 7:56:38 PM] SEVERE: Test failed: Spark job
[Nov 4, 2018 7:56:38 PM] SEVERE: Integration test for 'VirtualBoxVM' failed.

Resource Manager logs: (Full logs attached with the post)

User class threw exception: org.apache.spark.SparkException: Spark test failed: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/radoop/training-vm/tmp_1541357744748_x0migqc

Apart from this, I have also attached my yarn-site.xml and mapred-site.xml

Any help would be much appreciated.

Find more posts tagged with

AI Studio

AI Studio Radoop

Accepted answers

All comments

sgenzer

cc @jpuente

behroz89

Any help would be much appreciated. I have been looking for an answer for quite some time.

behroz89

Can someone please help me? I am still stuck. This should be a straight forward question to answer. Any help would be much appreciated.

asmahassani08

@behroz89
hope we got a response! i have the same error .

behroz89

Can someone please help us? Multiple people are having this issue. There is no way for us to figure out what is happening inside radoop. Any help would be much appreciated.

mborbely

Hi behroz89,

It seems like you have multiple problems in your connection setup.

Your mapreduce classpath seems to be set up incorrectly. Now I don't see your connection xml, it's possible you haven't added mapreduce.application.classpath Advanced Hadoop property from your mapred-site.xml. It's also possible that the necessary jars are not available at the provided location. Could you please double check it?
The Spark exception indicates that Spark is looking for files in the local file system (file:/tmp/radoop/training-vm/tmp_1541357744748_x0migqc) instead of hdfs (hdfs:///tmp/radoop/training-vm/tmp_1541357744748_x0migqc). We should look deeper into why this happens, but before that, see my following point.
You are using a single node VM with 5 GBs of memory. We can safely assume that you only want to use this setup for some proof of concept solution. But 5 GBs of memory is probably not enough even for that. We always advise at least 8 GBs of memory for Hadoop, even for the simplest use-cases. But more importantly, if you only want to play around with Radoop, I strongly suggest you use one of the Quickstart VM guides provided in Radoop documentation: https://docs.rapidminer.com/latest/radoop/installation/distribution-notes.html Please note that making your current VM work with Radoop will take considerably more effort, than using our step-by-step guides.

Cheers,
Máté

behroz89

Thank you for your reply.

My goal is to build a minimal Virtual box image with only Hadoop/Spark and Hive which works with Radoop. The distributions from cloudera/hortonworks are too big. That is why I started building the VM on my own.

In my Radoop connection settings, I added "mapreduce.application.classpath" but it didn't change anything.

I am attaching my connection file, do you see something wrong with it?

radoop-conf.xml

phellinger

@behroz89
Hopefully without disrupting the troubleshooting here, let me just share my thoughts.
I see the point of having a VM with as few components as possible.
Years ago, we automated the creation of such VMs with tools like Vagrant and Packer, but they are no longer in use, they no longer work. (Docker requires fewer resources and can be automated much easier.) But those VMs were also based on Cloudera and Hortonworks VMs, because these distributions do so much configuration automatically during installation that is hard to replicate.

You may also consider these options. E.g. removing all unnecessary services from a Cloudera VM using the Cloudera Manager interface is simple - keep only HDFS, Hive, YARN, Spark, Zookeeper (Sentry, if you need security). Besides, as Máté says, 5 GB RAM is still too small for an Apache Hadoop, YARN is not prepared to handle that properly even if you are an expert in tweaking its memory settings. It is simply not supported.

Regarding the troubleshooting:
The Export Logs output after the Full Test would be very helpful, it contains much more information. That is a zip file, feel free to send it to us.

behroz89

For troubleshooting, I am attaching the export log.

Thank you for your suggestion. I will give the docker option a try in parallel. I will also try to increase the memory to see how it behaves.

radoop-logs.zip

asimon

Export log gives a hint to check why the "Import job" test failed on the cluster.

[Jan 2, 2019 10:28:14 PM] SEVERE: Test data import from the distributed file system to Hive server 2 failed. Please check the logs of the MapReduce job on the ResourceManager web interface at http://${yarn.resourcemanager.hostname}:8088.

Did you happen to do that? Please look for the logs of the following application id (please double check that the application type is "MAPREDUCE" and the application name is "Radoop Import CSV job" ):

2019-01-02 22:28:11 INFO YarnClientImpl:296 - Submitted application application_1546361208705_0002

It should look something like this:

Then the details of the application - including its logs - is available after clicking on the application ID. If the error is not straightforward then please send all of the logs. I meant container-localizer-syslog, stdout, stderr and syslog, etc..