best input data format for large data sets?

New Altair Community Member

Mar 22, 2010

Updated Nov 5, 2024 by Jocelyn

Hi,

I wanted to ask what's the recommended import format for large datasets?

My dataset has the following specs:
- 36000 samples altogether splitted in 5 groups of 7200 samples each
- timestamp = id, integer label
- theoretical maximum of 1.200.000 integer attributes (for now a subset of about 5000 has been chosen, but more would be better)

Currently I am using an "import" process which does:
- CSV import (one CSV file for 7200 samples)
- define roles
- some normalization
- "write binary"

The binary files are re-read in the classification process, because it's faster than parsing all the CSV's every time. My problem is that if I increase the number of attributes in the CSV, the "import" process eats up all the memory and dies (7Gb). I also experimented with "Free Memory" it didnt help.

My question is now: is there a better format than CSV for large datasets which is still directly processable in decent speed so I can maybe drop this import step? What would you recommend?

Thanks,
Harald

Find more posts tagged with

AI Studio

best input data format for large data sets?

Find more posts tagged with

Quick Links