I'm trying to determine if RapidMiner might be easier to work with than Google BigQuery and GraphCHI for a big data project. The full test case is the ASA project at
http://stat-computing.org/dataexpo/2009/the-data.html, but the problem surfaces with just the 2008 flight data test case at
http://stat-computing.org/dataexpo/2009/the-data.html, which is about 632.2 MB when cleaned.
This is a CSV file that has also been imported into MySQL. Similar problems reading from either CSV or MySQL.
I've edited RapidMinerGUI like this to give it 2gb RAM on an 8gb machine. Didn't help; made no noticeable difference.
Near as I can tell, RapidMiner is trying to load the whole database regardless of the Sampling process step which specifies 1000 rows. This happens both via MySQL and via CSV, although MySQL generally fails with a "attempting to reuse connection after closed" error, presumably secondary to running out of RAM.
A confusing factor is that I keep getting a Sampler error to the effect of (from memory) SampleSet contains too few records, 1000 is required, which I think means it hasn't tried to determine the actual record count yet and is working from flakey metadata.
I've invested a weekend just getting this far and am close to giving up. Can someone help me get out of the weeds? Thanks!
Here's the latest process:
- <?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="190" width="413">
<operator activated="true" class="stream_database" compatibility="5.2.008" expanded="true" height="60" name="Stream Database" width="90" x="45" y="30">
<parameter key="connection" value="mysql"/>
<parameter key="table_name" value="flights"/>
<parameter key="label_attribute" value="ArrDelay"/>
<parameter key="id_attribute" value="TailNum"/>
</operator>
<operator activated="true" class="sample" compatibility="5.2.008" expanded="true" height="76" name="Sample" width="90" x="179" y="30">
<parameter key="sample_size" value="1000"/>
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
<parameter key="name" value="ArrDelay"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles">
<parameter key="ArrDelay" value="label"/>
</list>
</operator>
<connect from_op="Stream Database" from_port="output" to_op="Sample" to_port="example set input"/>
<connect from_op="Sample" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
And the stacktrace, which is what makes me think the sample operator is ignored:
ug 27, 2012 4:37:17 PM com.rapidminer.tools.jdbc.DatabaseHandler executeStatement
INFO: Executing query: 'SELECT *
FROM `flights`'
Exception in thread "RemoteProcess-Updater" Exception in thread "ProgressThread" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1649)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1426)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2924)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:477)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2619)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1788)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2209)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2619)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2569)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1521)
at com.rapidminer.tools.jdbc.DatabaseHandler.executeStatement(DatabaseHandler.java:1258)
at com.rapidminer.operator.io.DatabaseDataReader.getResultSet(DatabaseDataReader.java:116)
at com.rapidminer.operator.io.DatabaseDataReader.createExampleSet(DatabaseDataReader.java:124)
at com.rapidminer.gui.tools.dialogs.wizards.dataimport.DataImportWizard$1.run(DataImportWizard.java:73)
at com.rapidminer.gui.tools.ProgressThread$2.run(ProgressThread.java:189)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
java.lang.OutOfMemoryError: Java heap space
at java.util.LinkedList.<init>(LinkedList.java:78)
at com.rapidminer.repository.remote.RemoteRepository.getAll(RemoteRepository.java:482)
at com.rapidminer.repository.gui.process.RemoteProcessesTreeModel$UpdateTask.run(RemoteProcessesTreeModel.java:129)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)