[Solved] Out of Memory with Big Data; adding RAM and sampling didn't help.
BradJCox
New Altair Community Member
I'm trying to determine if RapidMiner might be easier to work with than Google BigQuery and GraphCHI for a big data project. The full test case is the ASA project at http://stat-computing.org/dataexpo/2009/the-data.html, but the problem surfaces with just the 2008 flight data test case at http://stat-computing.org/dataexpo/2009/the-data.html, which is about 632.2 MB when cleaned.
This is a CSV file that has also been imported into MySQL. Similar problems reading from either CSV or MySQL.
I've edited RapidMinerGUI like this to give it 2gb RAM on an 8gb machine. Didn't help; made no noticeable difference.
A confusing factor is that I keep getting a Sampler error to the effect of (from memory) SampleSet contains too few records, 1000 is required, which I think means it hasn't tried to determine the actual record count yet and is working from flakey metadata.
I've invested a weekend just getting this far and am close to giving up. Can someone help me get out of the weeds? Thanks!
Here's the latest process:
INFO: Executing query: 'SELECT *
FROM `flights`'
Exception in thread "RemoteProcess-Updater" Exception in thread "ProgressThread" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1649)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1426)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2924)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:477)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2619)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1788)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2209)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2619)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2569)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1521)
at com.rapidminer.tools.jdbc.DatabaseHandler.executeStatement(DatabaseHandler.java:1258)
at com.rapidminer.operator.io.DatabaseDataReader.getResultSet(DatabaseDataReader.java:116)
at com.rapidminer.operator.io.DatabaseDataReader.createExampleSet(DatabaseDataReader.java:124)
at com.rapidminer.gui.tools.dialogs.wizards.dataimport.DataImportWizard$1.run(DataImportWizard.java:73)
at com.rapidminer.gui.tools.ProgressThread$2.run(ProgressThread.java:189)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
java.lang.OutOfMemoryError: Java heap space
at java.util.LinkedList.<init>(LinkedList.java:78)
at com.rapidminer.repository.remote.RemoteRepository.getAll(RemoteRepository.java:482)
at com.rapidminer.repository.gui.process.RemoteProcessesTreeModel$UpdateTask.run(RemoteProcessesTreeModel.java:129)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
This is a CSV file that has also been imported into MySQL. Similar problems reading from either CSV or MySQL.
I've edited RapidMinerGUI like this to give it 2gb RAM on an 8gb machine. Didn't help; made no noticeable difference.
- MAX_JAVA_MEMORY=2000
A confusing factor is that I keep getting a Sampler error to the effect of (from memory) SampleSet contains too few records, 1000 is required, which I think means it hasn't tried to determine the actual record count yet and is working from flakey metadata.
I've invested a weekend just getting this far and am close to giving up. Can someone help me get out of the weeds? Thanks!
Here's the latest process:
- <?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="190" width="413">
<operator activated="true" class="stream_database" compatibility="5.2.008" expanded="true" height="60" name="Stream Database" width="90" x="45" y="30">
<parameter key="connection" value="mysql"/>
<parameter key="table_name" value="flights"/>
<parameter key="label_attribute" value="ArrDelay"/>
<parameter key="id_attribute" value="TailNum"/>
</operator>
<operator activated="true" class="sample" compatibility="5.2.008" expanded="true" height="76" name="Sample" width="90" x="179" y="30">
<parameter key="sample_size" value="1000"/>
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
<parameter key="name" value="ArrDelay"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles">
<parameter key="ArrDelay" value="label"/>
</list>
</operator>
<connect from_op="Stream Database" from_port="output" to_op="Sample" to_port="example set input"/>
<connect from_op="Sample" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
INFO: Executing query: 'SELECT *
FROM `flights`'
Exception in thread "RemoteProcess-Updater" Exception in thread "ProgressThread" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1649)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1426)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2924)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:477)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2619)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1788)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2209)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2619)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2569)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1521)
at com.rapidminer.tools.jdbc.DatabaseHandler.executeStatement(DatabaseHandler.java:1258)
at com.rapidminer.operator.io.DatabaseDataReader.getResultSet(DatabaseDataReader.java:116)
at com.rapidminer.operator.io.DatabaseDataReader.createExampleSet(DatabaseDataReader.java:124)
at com.rapidminer.gui.tools.dialogs.wizards.dataimport.DataImportWizard$1.run(DataImportWizard.java:73)
at com.rapidminer.gui.tools.ProgressThread$2.run(ProgressThread.java:189)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
java.lang.OutOfMemoryError: Java heap space
at java.util.LinkedList.<init>(LinkedList.java:78)
at com.rapidminer.repository.remote.RemoteRepository.getAll(RemoteRepository.java:482)
at com.rapidminer.repository.gui.process.RemoteProcessesTreeModel$UpdateTask.run(RemoteProcessesTreeModel.java:129)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
0
Answers
-
Hi,
2GB RAM is not that much - please try a larger value.
You can safely ignore the "error" on the Sample operator. It is simply a meta data error, which occurs because the Stream Database operator does not report how many data rows it will return before it has been executed. During execution everything should be fine, since the metadata is only used *before* executing the process to detect *potential* problems.
From the stacktrace it seems that you used some kind of wizard. Which one is it? Did you try to import the complete database table into the repository? In that case, the wizard tries indeed to read the complete table.
Best,
Marius0