"[SOLVED] Amount of expected memory usage with Read CSV"

DaveG
DaveG New Altair Community Member
edited November 5 in Community Q&A
Folks,

 I'm running RapidMiner 5.0 on a 64bit windows machine.  All I'm doing is reading in a CSV file consisting of ~500 attributes and 10,000 samples; all doubles.  It's approximately 65M on disk.  When I run it (not connecting the output to anything) the process takes up 3.6G of memory.  This seems excessive for the small amount of data I'm reading in.  Is there something I;m missing?  I did a search on the forum and found a couple of other questions like this but no answers.

Any help is appreciated!

Thanks,

  Dave
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    hi Dave,

    I can't give you any actual numbers, but yes, RapidMiner is quite memory hungry in some cases. However, sometimes it does not actually use all the memory that it claimed from the system; that's a particularity of all Java based programs. (that means that probably if you load additional data, the memory consumption does not increase much, but RapidMiner might reuse some of the memory it already claimed).

    Btw, RapidMiner 5.0 is at least 2 years old. You can download the current version 5.2.9 from our website.

    Best, Marius
  • DaveG
    DaveG New Altair Community Member
    Unfortunately the memory use scales with the input size.  When I bring in a 400M file (same format) the RapidMiner process grows to 35G.  I'll look into upgrading to see if that makes a difference.

    Thanks,

      Dave
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    I just did some testing here.

    I created two .csv files, both had 500 attributes, one had 6000 examples, the second one 30000 examples. (I tried with 60000 examples, but after I filled the file, Notepad++ refused to open it (too big)). So I let RapidMiner open both, the first one (55MB) needed about 150MB of memory, the second one (275MB) needed about 750MB memory. Both were opened by the latest RapidMiner development version without any problems (I have 8GB of RAM on this machine). Note that these were .csv files filled with only double values.

    Now for the .csv files with strings:
    500 attributes, 6000 examples, each string consisted of 26 chars: 77MB file, RM needed ~1GB to load the data.
    500 attributes, 30000 examples, each string consisted of 26 chars: 386MB file, RM needed ~3.5GB to load the data.


    This leads me to this:
    1) Please upgrade RapidMiner to the latest version.
    2) If you still run into these kind of problems, please consider using a more appropriate way of storing big amounts of data, e.g. a database or if you can't switch from .csv, try using multiple files. A 500MB .csv file is not the most efficient way of doing things ;) - I couldn't even open it via Notepad++ on my machine.

    Regards,
    Marco
  • DaveG
    DaveG New Altair Community Member
    Marco Boeck wrote:

    Hi,

    I just did some testing here.

    I created two .csv files, both had 500 attributes, one had 6000 examples, the second one 30000 examples. (I tried with 60000 examples, but after I filled the file, Notepad++ refused to open it (too big)). So I let RapidMiner open both, the first one (55MB) needed about 150MB of memory, the second one (275MB) needed about 750MB memory. Both were opened by the latest RapidMiner development version without any problems (I have 8GB of RAM on this machine). Note that these were .csv files filled with only double values.

    Now for the .csv files with strings:
    500 attributes, 6000 examples, each string consisted of 26 chars: 77MB file, RM needed ~1GB to load the data.
    500 attributes, 30000 examples, each string consisted of 26 chars: 386MB file, RM needed ~3.5GB to load the data.


    This leads me to this:
    1) Please upgrade RapidMiner to the latest version.
    2) If you still run into these kind of problems, please consider using a more appropriate way of storing big amounts of data, e.g. a database or if you can't switch from .csv, try using multiple files. A 500MB .csv file is not the most efficient way of doing things ;) - I couldn't even open it via Notepad++ on my machine.

    Regards,
    Marco
    I appreciate you looking into this.  I did some more testing on my side and I've come to the conclusion that it's a Java "issue".  The machine that was consuming ~35G of memory has 512G of memory available.  When I ran the same process on my local machine w/ 16G I was seeing memory numbers close to what you're seeing; ~4G.  This leads me to believe that the 35G of memory was grabbed because it was more efficient to do that vs. garbage collection to keep memory allocation down on a machine w/ 512G of main memory.

    I did upgrade just to stay up to date and saw the same general pattern.

    Thanks again!

    Dave

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi Dave,

    we're happy that you found the cause for your problems. But I have to say that: basically the first reply to your first post refers to the java garbage collection mechanism  :D

    Anyway: happy mining!

    -Marius