Working with large datasets
Dear all,
I am working with a really large dataset (with >2,6 million examples, ~25 attributes, 1 polynominal ID).
After renaming some attributes and generating a basic mathematical calulation with another attribute, I wanted to apply a model on the predict those large set with the model. Unfortunately, it always crashed havin exceed memory limit. Even when I split them in subsets of 1 million examples this happens.
So my questions:
- Is there a smarter way to store those data (short array or some other options)?
- Would it be better to convert the ID into interger values?
- Interestingly, the workflow crashes when using materialize data and/or free memory.
Could you give me some tips, working with larger datasets?
Cheers,
Markus
I am working with a really large dataset (with >2,6 million examples, ~25 attributes, 1 polynominal ID).
After renaming some attributes and generating a basic mathematical calulation with another attribute, I wanted to apply a model on the predict those large set with the model. Unfortunately, it always crashed havin exceed memory limit. Even when I split them in subsets of 1 million examples this happens.
So my questions:
- Is there a smarter way to store those data (short array or some other options)?
- Would it be better to convert the ID into interger values?
- Interestingly, the workflow crashes when using materialize data and/or free memory.
Could you give me some tips, working with larger datasets?
Cheers,
Markus