Working with large data sets

tatihz
tatihz New Altair Community Member
edited November 5 in Community Q&A
I have to work with a large data set, but I'm worried if RapidMiner supports very large amounts of data (let's say over a few hundred thousand records). Is there any place I can find some reference about performance and maximum number of records that the tool can work with?

Also, is there any data input format that is more recommended for dealing with large data sets?

Thanks.
Tagged:

Answers

  • haddock
    haddock New Altair Community Member
    Hi there,

    Try playing around with the example generator operators, like the the massive data generator. It is probably also worth pointing out that other things affect performance, like OS, memory, and of course what you plan to do with the data; this also means that finding "some reference about performance and maximum number of records that the tool can work with" may be impossible. After all, how long is the longest piece of string?
  • IngoRM
    IngoRM New Altair Community Member
    Hello,

    haddock is absolutely right. There is no general answer to the maximum amount of data RapidMiner can handle. Since many calculations of data mining models are done in memory, the amount of main memory is one of the most important factors restricting the amount of data for modeling. However, for certain model types as well as for the most preprocessing tasks, one can process the data in batches and then you will have hardly any limitation at all if the data is stored in a database. Just to give you an idea, I am currently working on a project with about 700000 items and we perform a lot of preprocessing and also the modeling runs smoothly. The largest database I remember we were working on in a customer project contained more than 30 Mio records - and everything worked well in this project, if you are know what you are doing at least  ;)

    Also, is there any data input format that is more recommended for dealing with large data sets?
    This can easily be answered: read your data from databases (the real relational ones, not access etc.) and store the results there as well. In this case you will also be able to work in batches fitting into memory during preprocessing and for certain models even for modeling. This is, by the way, always possible for scoring the data, i.e. applying a prediction model on large amounts of data.

    Cheers,
    Ingo
  • tatihz
    tatihz New Altair Community Member
    Thanks for the quick responses, this is exactly what I needed to know!
    Since I had no idea how the tool processed with the data, I was just looking for a general idea if the robustness of the tool could be a limiting factor.
    But you both had some good points there and you were very helpful. Thanks a lot!
  • IngoRM
    IngoRM New Altair Community Member
    We just had a discussion about that here and I realized that I should elaborate on this a bit more. I actuallyI intended to write some blog entries about RapidMiner and scalability anyway so this could serve as a starting point  ;)

    ...customer project contained more than 30 Mio records - and everything worked well in this project, if you are know what you are doing at least
    The important point is the latter one: it can be really frustrating to get one out-of-memory message after another and most users tend to conclude that the reason lies within the tool. I have to admit that this is often enough the case. But in many times the main reason simply is that it is not wise at all to apply specific algorithms on data sets with specific sizes. This could be easily seen if one only exactly knows what's happening inside the algorithm.

    So the conclusion could be to leave out those algorithms from RapidMiner and only keep those which are able to work on massive amounts of data only to make things more robust. We do not like that - simply because not every data set has an enormous size and why should we restrict ourself to a stripped-down sets of data mining algorithms we already have. So we decided for a everything-goes policy and moved the decision about the correct and robust analysis process from the tool to the user where in my opinion it is the only place it belongs. The only thing we can try is to support the users in their decisions for which the quick fixes from RapidMiner 5 are actually a first start and which will certainly be extended in future versions.

    Another interesting side node: especially for classification settings the amount of unlabeled data is most often very large and the amount of labeled data used for modeling much smaller. Thanks to the preprocessing models and the looping operators of RapidMiner the preprocessing and model application (scoring) of the unlabeled data can be done in batches and there is no problem at all. And even if the amount of labeled data is large it is most often not the wisest thing to do to use all of it. At least on sufficient compute servers with a nice amount of memory the running time starts to restrict the applicability and no longer the memory and problems with robustness. And again I would always state that those problems root in analysis process design and are actually not a problem of the selected tool (independent of RapidMiner).


    Just some additional thoughts some of you might find interesting.

    Cheers,
    Ingo