[Solved] RapidMiner Memory Problem ! (process crashes even with 74 GB of mem)
Hi everybody ,
I have a very large dataset and I am trying to do some joins and loops ,
My process loops through values of an attribute of a table and recalls a remembered table (which is remembered before entering the loop and it doesn't change during looping) and joins this remembered table with another one and then does a gini ranking on the result of this join and writes the result in a file !
So as you see there's a fixed thing happening at each loop, but the problem I have with rapid miner (and I guess it's the same problem with ALL rapidminer processes) is that the memory that it uses incrementally increases! (even when I don't use the GUI and I am running it on a remote linux server , it never goes down ! it always increases ! even when I use Free Memory operator)
I don't need partial results (The output result (i.e. example set) of each operator), all I want is the final result , as I said I tried Free Memory operator at the end of each loop but it still doesn't help (actually it doesn't have any effect !)
The loop has to iterate 12000 times but it stops after 500 times (which takes 2 hours which is too high; the first 100 loops take 3 minutes but it slows down exponentially ) , I use a remote linux server which has 74 GB of main memory and this process crashes because of lack of memory !
Please help me with this problem , this is a problem that I have every time using rapidminer , how can I make it more efficient in terms of memory ? (I use Xmx options and I tried Free Memory operator)
Thanks ,
Arian
I have a very large dataset and I am trying to do some joins and loops ,
My process loops through values of an attribute of a table and recalls a remembered table (which is remembered before entering the loop and it doesn't change during looping) and joins this remembered table with another one and then does a gini ranking on the result of this join and writes the result in a file !
So as you see there's a fixed thing happening at each loop, but the problem I have with rapid miner (and I guess it's the same problem with ALL rapidminer processes) is that the memory that it uses incrementally increases! (even when I don't use the GUI and I am running it on a remote linux server , it never goes down ! it always increases ! even when I use Free Memory operator)
I don't need partial results (The output result (i.e. example set) of each operator), all I want is the final result , as I said I tried Free Memory operator at the end of each loop but it still doesn't help (actually it doesn't have any effect !)
The loop has to iterate 12000 times but it stops after 500 times (which takes 2 hours which is too high; the first 100 loops take 3 minutes but it slows down exponentially ) , I use a remote linux server which has 74 GB of main memory and this process crashes because of lack of memory !
Please help me with this problem , this is a problem that I have every time using rapidminer , how can I make it more efficient in terms of memory ? (I use Xmx options and I tried Free Memory operator)
Thanks ,
Arian
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="836" width="3227">
<operator activated="true" class="read_csv" compatibility="5.2.008" expanded="true" height="60" name="Read Feature" width="90" x="246" y="75">
<parameter key="csv_file" value="%{feature_table_file}"/>
<parameter key="column_separators" value=","/>
<parameter key="use_quotes" value="false"/>
<parameter key="parse_numbers" value="false"/>
<parameter key="date_format" value="yyyy-MM-dd"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="read_csv" compatibility="5.2.008" expanded="true" height="60" name="Read Main" width="90" x="246" y="210">
<parameter key="csv_file" value="%{main_table_file}"/>
<parameter key="column_separators" value=","/>
<parameter key="use_quotes" value="false"/>
<parameter key="parse_numbers" value="false"/>
<parameter key="date_format" value="yyyy-mm-dd"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="subprocess" compatibility="5.2.008" expanded="true" height="94" name="Gini_ranking" width="90" x="715" y="75">
<process expanded="true" height="796" width="1036">
<operator activated="true" class="multiply" compatibility="5.2.008" expanded="true" height="76" name="Multiply" width="90" x="112" y="120"/>
<operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="246" y="120">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|id|discharge|event"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="remember" compatibility="5.2.008" expanded="true" height="60" name="Remember (3)" width="90" x="380" y="120">
<parameter key="name" value="main_table"/>
<parameter key="io_object" value="IOObject"/>
</operator>
<operator activated="true" class="loop_values" compatibility="5.2.008" expanded="true" height="60" name="Loop Values" width="90" x="112" y="30">
<parameter key="attribute" value="feature"/>
<parameter key="iteration_macro" value="feature_value"/>
<process expanded="true" height="712" width="2459">
<operator activated="true" class="recall" compatibility="5.2.008" expanded="true" height="60" name="Recall" width="90" x="313" y="210">
<parameter key="name" value="main_table"/>
<parameter key="io_object" value="IOObject"/>
<parameter key="remove_from_store" value="false"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Filter Examples" width="90" x="45" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="feature=%{feature_value}"/>
</operator>
<operator activated="true" class="replace" compatibility="5.2.008" expanded="true" height="76" name="Replace (2)" width="90" x="179" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="feature"/>
<parameter key="replace_what" value="\w+"/>
<parameter key="replace_by" value="1"/>
</operator>
<operator activated="true" class="parse_numbers" compatibility="5.2.008" expanded="true" height="76" name="Parse Numbers" width="90" x="313" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="feature"/>
</operator>
<operator activated="true" class="join" compatibility="5.2.008" expanded="true" height="76" name="Join" width="90" x="447" y="30">
<parameter key="join_type" value="right"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="id" value="id"/>
<parameter key="discharge" value="discharge"/>
</list>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.2.008" expanded="true" height="94" name="Replace Missing Values" width="90" x="581" y="30">
<parameter key="attribute" value="feature"/>
<parameter key="default" value="zero"/>
<list key="columns"/>
<parameter key="replenishment_value" value="0"/>
</operator>
<operator activated="true" class="rename" compatibility="5.2.008" expanded="true" height="76" name="Rename" width="90" x="715" y="30">
<parameter key="old_name" value="feature"/>
<parameter key="new_name" value="%{feature_value}"/>
<list key="rename_additional_attributes"/>
</operator>
<operator activated="true" class="weight_by_gini_index" compatibility="5.2.008" expanded="true" height="76" name="Weight by Gini Index" width="90" x="849" y="30">
<parameter key="normalize_weights" value="false"/>
<parameter key="sort_weights" value="false"/>
</operator>
<operator activated="true" class="weights_to_data" compatibility="5.2.008" expanded="true" height="60" name="Weights to Data" width="90" x="983" y="30"/>
<operator activated="true" class="write_csv" compatibility="5.2.008" expanded="true" height="76" name="Write CSV" width="90" x="1117" y="30">
<parameter key="csv_file" value="%{result_file}"/>
<parameter key="column_separator" value=","/>
<parameter key="write_attribute_names" value="false"/>
<parameter key="quote_nominal_values" value="false"/>
<parameter key="format_date_attributes" value="false"/>
<parameter key="append_to_file" value="true"/>
</operator>
<operator activated="true" class="free_memory" compatibility="5.2.008" expanded="true" height="76" name="Free Memory" width="90" x="1251" y="30"/>
<connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Recall" from_port="result" to_op="Join" to_port="right"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
<connect from_op="Replace (2)" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/>
<connect from_op="Parse Numbers" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_op="Weight by Gini Index" to_port="example set"/>
<connect from_op="Weight by Gini Index" from_port="weights" to_op="Weights to Data" to_port="attribute weights"/>
<connect from_op="Weights to Data" from_port="example set" to_op="Write CSV" to_port="input"/>
<connect from_op="Write CSV" from_port="through" to_op="Free Memory" to_port="through 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
</process>
</operator>
<connect from_port="in 1" to_op="Loop Values" to_port="example set"/>
<connect from_port="in 2" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Remember (3)" to_port="store"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="source_in 3" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
</process>
</operator>
<connect from_op="Read Feature" from_port="output" to_op="Gini_ranking" to_port="in 1"/>
<connect from_op="Read Main" from_port="output" to_op="Gini_ranking" to_port="in 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
0
Answers
-
Ok , I found the problem , the problem is working with too many nominal attributes , so I converted almost everything to numerical type (even the polynomial features and dates !) and I mapped it back to its original value after doing all the job.
But it's still weird that it uses this much memory ! 74 GBs !0