🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

best input data format for large data sets?

User: "harri678"
New Altair Community Member
Updated by Jocelyn
Hi,

I wanted to ask what's the recommended import format for large datasets?

My dataset has the following specs:
- 36000 samples altogether splitted in 5 groups of 7200 samples each
- timestamp = id, integer label
- theoretical maximum of 1.200.000 integer attributes (for now a subset of about 5000 has been chosen, but more would be better)

Currently I am using an "import" process which does:
- CSV import (one CSV file for 7200 samples)
- define roles
- some normalization
- "write binary"

The binary files are re-read in the classification process, because it's faster than parsing all the CSV's every time. My problem is that if I increase the number of attributes in the CSV, the "import" process eats up all the memory and dies (7Gb). I also experimented with "Free Memory" it didnt help.

My question is now: is there a better format than CSV for large datasets which is still directly processable in decent speed so I can maybe drop this import step? What would you recommend?

Thanks,
Harald

Find more posts tagged with

Sort by:
1 - 3 of 31
    User: "RalfKlinkenberg"
    New Altair Community Member
    Hi Harri,

    if your data is sparse (a lot of zero and significantly less non-zero attribute values), you may want to try the sparse file and data formats. They store only the non-zero values and hence are the preferred representation for sparse data sets like large text collections.

    Best regards,
    Ralf
    User: "harri678"
    New Altair Community Member
    OP
    Hi,

    I managed it with the Read AML Operator and sparse storage. Thanks!

    Greetings, Harald
    User: "wessel"
    New Altair Community Member
    There seems to be a big improvement in version 5 compared to version 4 when reading data.
    Version 5 is much faster. So download version 5.