What is the maximum number of instances handled by Rapidminer?

Hi all,

I have been using Rapidminer on Windows for 2-3 months now and have been very happy with the features and analysis tools it provides. Until now I have been using the feature selection operator on a data with 300 instances and ~300-400 varaiables and it gives me good results. So recently I increased my dataset to 1000 instances but since then i have been getting "Out of memory " errors and the process stops . I am at the last leg of my analysis so its kind of a dampener to get these errors now. :'(

I even tried increasing the memory for java using the Xmx option but no success , so if anyone has ideas / suggestions to solve my problem please let me know .

Thanks,
Emma

Find more posts tagged with

AI Studio

Accepted answers

All comments

land

Hi Emma,
I already worked with rapid miner with over 8000 attributes and 25000 examples without getting out of memory. I have to admit that it needed 8 GB of RAM, but it worked flawlessly under XP 64 using a x64 java. So I'm a little bit surprised by your problem.
Did the memory monitor reflect the change sof the -Xmx parameter? Did you have more memory available before the execption?
Do you use any other memory consuming operators within your process like svms, pca?

Greetings,
Sebastian

IngoRM

Hi,

there is no upper bound for the number of instances - at least not in principle, i.e. if the data storage was done appropriate. We often work with databases having hundres of millions of tupels without any problem (but of course this will not work for all processes - feature selection might be a problem here.)

Could you please post your process (from the XML tab) here? I probably could give some suggestions how you could tune your feature selection process so that it works.

Cheers,
Ingo

emma

I sincerely appreciate all the replies.

The windows computer in my lab has 1GB RAM and I used the -Xmx512m and -Xmx1024 option. In both cases the memory monitor reflected the changes . Also, the errors messages included comments like "exceeded maximum heap size".

As for the feature selection process I am using , the final goal is to attain a binary classification based on a decision tree.The process used is as follows,
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="C:\Program Files\Rapid-I\RapidMiner-4.2\info_gain"/>
</operator>
<operator name="FeatureSelection" class="FeatureSelection" expanded="yes">
<operator name="FSChain" class="OperatorChain" expanded="yes">
<operator name="XValidation" class="XValidation" breakpoints="after" expanded="yes">
<parameter key="average_performances_only" value="false"/>
<parameter key="create_complete_model" value="true"/>
<operator name="DecisionTree" class="DecisionTree">
</operator>
<operator name="ApplierChain" class="OperatorChain" expanded="yes">
<operator name="Applier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="Evaluator" class="Performance">
</operator>
</operator>
</operator>
<operator name="ProcessLog" class="ProcessLog">
<parameter key="filename" value="C:\Documents and Settings\emma\My Documents\rm_workspace\error.log"/>
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
</operator>
</operator>
</operator>
</operator>

Thanks again,
Emma

Legacy User

I am starting a project on churning. I have around 35 million entries with about 120 attributes. I was going to work with Clementine but I would like to give RapidMiner a try on this database. Do you think I will have problems?, what would be the minimum computer configuration to use?. i usually sample the data, pre work on the models and then run the trial models with the whole data, modifiy and validate the models would it be a strategy to use with Rapid Miner?

Thanks for your comments.

Ignacio

steffen

Hello

@Emma => One idea from my side: As you using DecisionTree with InformationGain as splitting criterion, I suggest to use the operator "InfoGainWeighting". This one will calculate the weight of each feature according to informationgain as if the feature were the first one to use for splitting.

Then you can either use...

WeightGuidedFeatureSelection instead of FeatureSelection
AttributeWeightsSelection if you want to preselect some attributes. In this case I recommend to keep attributes at the upper end by using top k

hope this was helpful

Steffen

Legacy User

Hi all,

in fact I will be working with around 38 million instances and 40 attributes. Any comments on using RapidMiner on such a huge databese are welcome.

Ignacio

IngoRM

Hi,

@Ignacio:

As I stated before we have already sucessfully worked with much larger data sets (several hundreds of millions tupels) with RapidMiner - the important thing is that not every operator / every process can be applied on such large data sets. But if you know what you are doing or if you can live with some trial and error this is certainly possible. Although 40 million instances with 40 attributes might still fit in memory (at least on a 16 Gb machine) it is probably better to work on the database as long as possible. The trick here is to use the CachedDatabaseExampleSource operator and use only the results of aggregations, samples, filtered sets, one-pass models etc. in memory and leave the original data in the database.

Cheers,
Ingo

Legacy User

Hi Ingo,

It would be a huge help to have the help documentation indicate whether the operator is memory bound or not, single-passm etc...as well as to have a number of example of working with abitrailiy large N & M datasets.

Great software though! I was very impressed with the responsiveness when we discussed mutivariate series to windows way back when.

Jay

IngoRM

Hi Jay,

nice to hear from you again. I have added the extension of the documentation by this type of information to our TODO list.

Cheers,
Ingo