Why does RapidMiner delete datarows when automatic feature selection is applied?
SanderMEs
New Altair Community Member
Maybe a very stupid question, but my input consists of 15577 data rows, my output only consists of 4500 data rows when I apply auto feature selection in data preparation.
In addition to that, can I reliably compare the confusion matrices of the baseline model (with 15577 rows) and the RapidMiner model (with +/- 4500 rows) when sizes differ but data is the same?
In addition to that, can I reliably compare the confusion matrices of the baseline model (with 15577 rows) and the RapidMiner model (with +/- 4500 rows) when sizes differ but data is the same?
Tagged:
0
Best Answer
-
Hi @SanderMEs,
No, it's not a stupid question : :
AutoModel is splitting your dataset in 2 parts:
- 60% of the data is used to train the model
- 40% of the data is used to test the model (it is a hold out set).
Then on your test set AutoModel remove 2/7 of your data in your test set.
Your output data are the predictions and the associated confusion matrix and are based on this final test set, that's why your ouput files should represent 4500 rows (15577 x 40% x 5/7 rows)
Regards,
Lionel1
Answers
-
Hi @SanderMEs,
No, it's not a stupid question : :
AutoModel is splitting your dataset in 2 parts:
- 60% of the data is used to train the model
- 40% of the data is used to test the model (it is a hold out set).
Then on your test set AutoModel remove 2/7 of your data in your test set.
Your output data are the predictions and the associated confusion matrix and are based on this final test set, that's why your ouput files should represent 4500 rows (15577 x 40% x 5/7 rows)
Regards,
Lionel1