Home
Discussions
Community Q&A
Transform data into table with every attribute representation
moritz_moeller
Hey there,
since my data set is too big to analyze it with a clustering algorithm (moreover I don't want to wait as long as it needs), I want to transform it into a smaller set.
The question I have is if it is possible to transform it into a data set that represents every attribute in a representative amount? For example: I have a data set that has 3 columns that all have 5 different, possible values (i.e. 1-5) and 10 million rows. Now I want to have a data set that contains all 3 columns with all types of values but only 100k rows so that I can analyze it. Is there an option to do that automatically in RM? If not I think I have to do it manually somehow.
Thanks and Greetings,
Moritz
Find more posts tagged with
AI Studio
Accepted answers
SGolbert
Hi Moritz,
I haven't found dimension reduction techniques for polinomial variables in RM. Maybe it is possible to use feature selection.
Regarding the rows, these are the examples you are using for training and testing. It is up to you, how many examples you want to use. There is no need to use all the rows, at least while you are not deploying the final model. It of course depends of the kind of data also, if it is a time series the approach should be different.
Regards,
Sebastian
All comments
SGolbert
Hi Moritz,
I haven't found dimension reduction techniques for polinomial variables in RM. Maybe it is possible to use feature selection.
Regarding the rows, these are the examples you are using for training and testing. It is up to you, how many examples you want to use. There is no need to use all the rows, at least while you are not deploying the final model. It of course depends of the kind of data also, if it is a time series the approach should be different.
Regards,
Sebastian
Telcontar120
You can use the Sample or Sample(stratified) operator to get a smaller set for your initial testing. The option to do stratified sampling allows you to preserve a relative distribution for a label. In your case, you don't necessarily have just one label but you could try designating any one of your 3 columns as the label (use Set Role) and then after sampling just check the distribution of all 3 columns relative to the orginal complete dataset. As long as the sample is large enough and your values are not extreme outliers, you should get a representative mix of all your possible values.
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)