Is 16GB of RAM the only way to go?
Hello everybody, I'm new here and new in general to Data Mining...
I've been reading some threads here and also been through the WEKA mailing list, and I have come to the sad conclusion that the only way to process large and complex streams of data is to have as much RAM as possible - using RapidMiner or WEKA, of course. I add this because I have been shown some pretty crazy flows on Clementine being run succesfully on modest systems with 4 or even 2 or 1! GB of RAM. I've been told that Clementine creates a lot of temporary files on the hard disk and has a somewhat optimized stream execution code. Optimized at least to allow you to put in anything you wish and let your HDD space handle it, without having to worry about RAM or heap sizes and such. Is this correct?
The first thing I tried to do with RapidMiner (4GB RAM rig) was to convert a 100 fields x 3,000,000 records SPSS file into ARFF, and it wouldn't get past the READ SPSS node. Out of memory error! The same stuff I bumped into with WEKA, and this contrasts a lot to Clementine handling quite, quite a lot more with only 1 GB of RAM.
Regarding to having to use 16GB of RAM as a rule... am I sadly right? Is it not possible to, for example, to make RapidMiner use Window's Virtual Memory? Set it to any crazy amount and let RapidMiner use it, that would be a charm. It probably isn't very efficient at all, but hey, it's definately better than directly not being able to get the job done.
On the other hand, do the enterprise versions of RapidMiner have optimized stream execution codes? If I buy the software, how would you cope with my huge data flows need?
I'm no programmer and I couldn't help you with anything, but, come on guys! If SPSS can manage huge amounts of data and flows then you should be able to do so as well! Remember they also use Java... it's not like there is a language limitation, right?
Thank your for your great program, and your kind attention.
Cheers.
I've been reading some threads here and also been through the WEKA mailing list, and I have come to the sad conclusion that the only way to process large and complex streams of data is to have as much RAM as possible - using RapidMiner or WEKA, of course. I add this because I have been shown some pretty crazy flows on Clementine being run succesfully on modest systems with 4 or even 2 or 1! GB of RAM. I've been told that Clementine creates a lot of temporary files on the hard disk and has a somewhat optimized stream execution code. Optimized at least to allow you to put in anything you wish and let your HDD space handle it, without having to worry about RAM or heap sizes and such. Is this correct?
The first thing I tried to do with RapidMiner (4GB RAM rig) was to convert a 100 fields x 3,000,000 records SPSS file into ARFF, and it wouldn't get past the READ SPSS node. Out of memory error! The same stuff I bumped into with WEKA, and this contrasts a lot to Clementine handling quite, quite a lot more with only 1 GB of RAM.
Regarding to having to use 16GB of RAM as a rule... am I sadly right? Is it not possible to, for example, to make RapidMiner use Window's Virtual Memory? Set it to any crazy amount and let RapidMiner use it, that would be a charm. It probably isn't very efficient at all, but hey, it's definately better than directly not being able to get the job done.
On the other hand, do the enterprise versions of RapidMiner have optimized stream execution codes? If I buy the software, how would you cope with my huge data flows need?
I'm no programmer and I couldn't help you with anything, but, come on guys! If SPSS can manage huge amounts of data and flows then you should be able to do so as well! Remember they also use Java... it's not like there is a language limitation, right?
Thank your for your great program, and your kind attention.
Cheers.
Tagged:
0
Answers
-
I am at 8GB RAM now and still could use more. I think the problem comes down to people thinking data mining is a one-shot approach where all data should be applied. Try some feature selection methods prior to analysis and your memory usage will be much smaller.
-Gagi0 -
Hello Gagi, thank you for your answer.
I understand what you are saying - I read about it on books and my partner told me about it as well. Data MODELLING shouldn't be about putting huge chunks of inputs and expecting to get something out of it. But in order to clean that huge chunk, I think the Data Mining tool should be able to manage it. The huge streams I saw at my partner's Clementine workstation didn't include a single bit of modelling, they were all data exploration, comprehension and preparation streams.
Anyway, consider the situation where many features have predicting value. It would be a pity to prune them down just for memory's sake. I still believe, besides all data mining suggestions, normal practices and common scenarios, the tools should be prepared for any kind of streams. I mean, all the books say it on the first page... data mining could be the evolution of statistics to cope with the massive amount of data available today... but it seems the memory issue - what would be a "little catch" - puts great open source applications - such as this one - back against the wall when compared to licensed software such as Clementine. I still ask though if the licensed versions of RapidMiner have optimized streams executions.
I'm sorry if I'm saying nonesense, please remember I'm just starting and my education comes only from my partner, a bunch of books and playing around a little bit with RapidMiner, Clementine and WEKA.
Thanks!
0 -
Hi,
don't make this a open vs. closed source discussion: this is simply not true. If Clementine has built in streaming: fine. So has RapidMiner, but it is simply not the default (for a bunch of reasons). In order to perform preprocessing (not modelling) on data sets of arbitrary sizes, you will have to use a combination of- a database as data input
- the stream database operator configured to your database or use the default one and use an appropriately configured database
- the option "create view" for all preprocessing operators where possible
By the way: On a 64 Bit system if should indeed be possible to use more memory than physically available and let the OS and Java do the temp file approach similar to the way described by you for Clementine. It's probably sufficient to adapt the amount of memory in one of our start scripts and start RapidMiner with the script. But calculations will become ridicously slow then and I would recommend to design better processes and keep control of what is happening instead of using this shutgun approach.
We are able to do this. You just don't have found the right buttons yet
I'm no programmer and I couldn't help you with anything, but, come on guys! If SPSS can manage huge amounts of data and flows then you should be able to do so as well!
Cheers,
Ingo0 -
Hello Ingo, thank you very much for your answer! And sorry for the open vs. closed implication, it was not meant. I was just using something as reference to compare.
Indeed, it would seem I still need to explore better RapidMiner, the truth is I only tried the conversion I mentioned before and got the error.
Thanks a lot!0 -
No problem at all, it's just that we hear arguments like this quite a lot and I expect that one of the major reasons is: "**** it, could it really be that this 0 Euro license costs software is actually better then my proprietary solution for a kazillion bucks of license fees?" Short answer: yes, it really can be. Welcome to RapidMiner
Just let me add: RapidMiner itself is for free but our support and expert knowledge is not. And making things more scalable without loosing too much performance / accuracy is definitely part of this expert knowledge as I sure everybody understands.
Cheers,
Ingo0 -
Hello again, I saw something in RapidMiner and I was wondering if you could help me a little bit...
Description from the APPEND operator:
"This operator merges two or more given example sets by adding all examples in one example table containing all data rows. Please note that the new example table is built in memory and this operator might therefore not be applicable for merging huge data set tables from a database. In that case other preprocessing tools should be used which aggregates, joins, and merges tables into one table which is then used by RapidMiner."
What other -free - preprocessing tools would you recommend, data miner friendly?
Thanks again.0 -
Hi,
anything that builds these joins directly in the database. I'm always using RapidMiner, so I don't know
Greetings,
Sebastian0 -
If there isn't any well known free preprocessing software, could you help me out with how to do this?Ingo Mierswa wrote: By the way: On a 64 Bit system if should indeed be possible to use more memory than physically available and let the OS and Java do the temp file approach similar to the way described by you for Clementine. It's probably sufficient to adapt the amount of memory in one of our start scripts and start RapidMiner with the script. But calculations will become ridicously slow then and I would recommend to design better processes and keep control of what is happening instead of using this shutgun approach.
We are able to do this. You just don't have found the right buttons yet
Thanks!
PD: Ingo, haven't been contact yet by sales.0 -
Hi,
well, for the append step itself you have two options within RapidMiner: using streamed data access and write the result down or write the data into a database and append it there via SQL execution. This is basically the same thing an open-source ETL tool would do and the same is possible also directly within RapidMiner so you would not really have to change.
However, you could also try Talend as an ETL tool for this. They are a partner company of Rapid-I and maybe you prefer there solution for this ETL step over performing it within RapidMiner.
Cheers,
Ingo0 -
Is there any difference between Community and Enterprise version on this feature?
I am testing such process but failed due to no java heap space:
1. Stream database, the table has about 5 million rows and 42 columns
2. Select Attributes, select part of fields
3. Set Role, set one attribute as label
4. Linear Regression
Then I run the process. After 10 minutes, the process failed due to no java heap space.
Jun 25, 2010 3:15:53 PM SEVERE: Process failed: Java heap space
Jun 25, 2010 3:15:53 PM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
+- Stream Database[1] (Stream Database)
+- Select Attributes[1] (Select Attributes)
+- Set Role[1] (Set Role)
==> +- Linear Regression[1] (Linear Regression)Ingo Mierswa wrote:
Hi,
don't make this a open vs. closed source discussion: this is simply not true. If Clementine has built in streaming: fine. So has RapidMiner, but it is simply not the default (for a bunch of reasons). In order to perform preprocessing (not modelling) on data sets of arbitrary sizes, you will have to use a combination of- a database as data input
- the stream database operator configured to your database or use the default one and use an appropriately configured database
- the option "create view" for all preprocessing operators where possible
By the way: On a 64 Bit system if should indeed be possible to use more memory than physically available and let the OS and Java do the temp file approach similar to the way described by you for Clementine. It's probably sufficient to adapt the amount of memory in one of our start scripts and start RapidMiner with the script. But calculations will become ridicously slow then and I would recommend to design better processes and keep control of what is happening instead of using this shutgun approach.
We are able to do this. You just don't have found the right buttons yet
Cheers,
Ingo0 -
Hi,
the problem on this setting is, that the LinearRegression will have to copy all the data into a numerical matrix in order to invert it. This numerical matrix must be stored in main memory, and that's causes the memory problem.
For large data sets I would suggest using linear scanning algorithms like Naive Bayes or the Perceptron.
Greetings,
Sebastian0 -
Thanks for explanation.
I looked into the RapidMiner code and found that few modeling operator supports Streaming data processing, which is majorly for data preprocessing, right?Sebastian Land wrote:
Hi,
the problem on this setting is, that the LinearRegression will have to copy all the data into a numerical matrix in order to invert it. This numerical matrix must be stored in main memory, and that's causes the memory problem.
For large data sets I would suggest using linear scanning algorithms like Naive Bayes or the Perceptron.
Greetings,
Sebastian0 -
Hi,
well I think that's correct, but how exactly is your criterion "supporting streaming data processing" defined?
Greetings,
Sebastian0 -
What I said "supporting streaming data processing" is that operator ONLY reads data from streaming database operator, and there is no memory copy etc so that the operator can handle extremly large datasets even with low PC memory.
Sebastian Land wrote:
Hi,
well I think that's correct, but how exactly is your criterion "supporting streaming data processing" defined?
Greetings,
Sebastian0