Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Very Large dataset filesize despite only 2000 examples

Hi, I am a newbie so apologies in advance if I'm missing something obvious.

I am working on a binary classifier for use with a large synthetic dataset for credit card fraud which I have split and sampled into a training and testing dataset, both with balanced classes, 1000 of each. However, there seems to be something up somewhere along the line. The full dataset with 6.3M examples occupies 538MB. However, my training and test datasets are taking up 95.3MB and should only be a tiny fraction of this size. They also behave like 100MB files, taking ages up to open up etc. Training dataset caused AM to crash. Can somebody tell me where I am going wrong please? TIA Ray.

Find more posts tagged with

AI Studio

Accepted answers

MartinLiebig

Hi @Ray_C,

let me explain what @kayman and I are checking for:

If you store nominal data, then the actual table does not contain any string. Instead it contains an integer. RapidMiner maintains a Mapping HashMap (aka dictionary in python) to map the integers to their respective strings again. This is very efficient for most data, where you only have a few strings.

There are two cases where this may create big file sizes.

'Duplication of Data': If you have columns with unique strings in them we actually store them twice. Once as an integer in the table and once in the Mapping. This causes then unnecessarily big file sizes.

Not cleaned up tables: If you have a column and remove all but one string using a filter examples operator (or your sample), then the mapping is not cleaned up. The table still "knows" what other strings may exist (and there is some good reasoning for this). This means even though you filtered some things away, you didn't really filter it out of the mapping and the file still keeps the same size.

Remove Unused Values cleans up the mapping table and is thus the usual answer for those things.

Can you maybe check if you have nominal columns with a very high number of different strings in? That would be the first issue.

Best,

Martin

All comments

Ray_C

Process xml attached if that helps

process.xml

MartinLiebig

Hi,

can you use a "Remove Unused Values" operator right before the store? That could do it.

Best,

Martin

Ray_C

Martin, many thanks for suggestion.

I have added that operator before the stores but it appears just to have moved the bottleneck from AutoModel's (non-handling) of the pseudo 100MB dataset, back to the data prep process itself, where, as I type and for the past 5 minutes, one of the "Remove Unused Values" remains in progress, and may not in fact complete I suspect.

I think maybe I could do with carrying out some research on the handling of very large datasets.

Just in: RM has crashed with an OOM exception. I've got a Core i7 16GB RAM so I need to change the methodology for sure.

kayman

What if you move the unused Value operator more upfront? I typically tend to use it when I have a filter or like in your case split operator.
Try with adding one on both split outputs, as the 'hidden' information will travel through this otherwise. Also try to tick the 'include special attributes' option, given you use a role the remove option might have limited impact if these are like unique identifiers as all your entries will be special.

Ray_C

Thanks Kayman, I have moved the operator as suggested. Process now completes (albeit laboriously) but resulting Test and Training datasets still far too large @ 31.2MB and 68.1MB respectively (whose sizes mirror the 70:30 ratio in the Split data operator incidentally.)

I am not sure what's going on TBH, specifically what the Remove Unused Value operator is doing or supposed to do - this is a synthetic dataset with no missing values etc, what would it be removing after the split?

Also, I do find it unusual that there is no out-of-box answer for this issue (and no disrespect intended here). I mean I am assuming that many, many people will have worked on this dataset before (kaggle.com/ntnu-testimon/paysim1), and many will have split the data into Test and Training using the Split Data operator within RM I am sure.

Yet I can't seem to find any references online to anybody else experiencing this kind of issue. I mean I am not trying to achieve anything that could be described as complex, I am barely off first base, with the only operation being the assignation of a label which is required in order to obtain a balanced dataset? I just don't understand why the split datasets do not appear to be amenable to the sampling process?

I keep asking myself is there something fundamentally wrong with my approach but the responses to date (much appreciated) do not suggest that there is?

Image: https://us.v-cdn.net/6030995/uploads/editor/qz/lhpngwwcw8mz.jpg

MartinLiebig

Hi @Ray_C,

let me explain what @kayman and I are checking for:

There are two cases where this may create big file sizes.

Remove Unused Values cleans up the mapping table and is thus the usual answer for those things.

Can you maybe check if you have nominal columns with a very high number of different strings in? That would be the first issue.

Best,

Martin

Ray_C

Hi Martin, thanks for taking the time to set me right. So you were dead right: There were two features of polynominal type with millions of unique values (rendering them useless as predictors for binary classification anyway) in there. So I stripped out these 2 from the dataset as a first step, the succeeding processes completed quickly, resulting in training and test dataset sizes of 117KB each. Thanks again to yourself and Kayman.

Telcontar120

Another operator you may be interested to check out is the Replace Rare Values operator.
This is helpful when you have nominal attributes with many different unique values, some of which might occur frequently enough to be useful, but most of which occur infrequently and are thus not useful. It would allow you to keep the largest ones and remap all the other values into a generic "Other" category much more easily than the normal Map operator (which would require you to list them all out individually).

Ray_C

@Telcontar120, belated thanks for the tip - yes that sounds like a very useful operator indeed. Cheers.