best input data format for large data sets?
harri678
New Altair Community Member
Hi,
I wanted to ask what's the recommended import format for large datasets?
My dataset has the following specs:
- 36000 samples altogether splitted in 5 groups of 7200 samples each
- timestamp = id, integer label
- theoretical maximum of 1.200.000 integer attributes (for now a subset of about 5000 has been chosen, but more would be better)
Currently I am using an "import" process which does:
- CSV import (one CSV file for 7200 samples)
- define roles
- some normalization
- "write binary"
The binary files are re-read in the classification process, because it's faster than parsing all the CSV's every time. My problem is that if I increase the number of attributes in the CSV, the "import" process eats up all the memory and dies (7Gb). I also experimented with "Free Memory" it didnt help.
My question is now: is there a better format than CSV for large datasets which is still directly processable in decent speed so I can maybe drop this import step? What would you recommend?
Thanks,
Harald
I wanted to ask what's the recommended import format for large datasets?
My dataset has the following specs:
- 36000 samples altogether splitted in 5 groups of 7200 samples each
- timestamp = id, integer label
- theoretical maximum of 1.200.000 integer attributes (for now a subset of about 5000 has been chosen, but more would be better)
Currently I am using an "import" process which does:
- CSV import (one CSV file for 7200 samples)
- define roles
- some normalization
- "write binary"
The binary files are re-read in the classification process, because it's faster than parsing all the CSV's every time. My problem is that if I increase the number of attributes in the CSV, the "import" process eats up all the memory and dies (7Gb). I also experimented with "Free Memory" it didnt help.
My question is now: is there a better format than CSV for large datasets which is still directly processable in decent speed so I can maybe drop this import step? What would you recommend?
Thanks,
Harald
Tagged:
0
Answers
-
Hi Harri,
if your data is sparse (a lot of zero and significantly less non-zero attribute values), you may want to try the sparse file and data formats. They store only the non-zero values and hence are the preferred representation for sparse data sets like large text collections.
Best regards,
Ralf0 -
Hi,
I managed it with the Read AML Operator and sparse storage. Thanks!
Greetings, Harald0 -
There seems to be a big improvement in version 5 compared to version 4 when reading data.
Version 5 is much faster. So download version 5.0