Set Role messes with outlier detection, but setting role later ruins predictions
HaukeV
New Altair Community Member
Hi,
I've run into an interesting issue. If I assign a label to my target before standardization and outlier detection, my outliers are wrong. In general however the model performs OK. I do however have a few very significant outliers in the dataset that I would like to detect and remove. These are only found when everything is set to attribute. If I then later assign the role and perform the process, my accuracy drops from 66% to 30%...
Nothing else changes in the model, same selected attributes, same type of model,...
Any help?
I've run into an interesting issue. If I assign a label to my target before standardization and outlier detection, my outliers are wrong. In general however the model performs OK. I do however have a few very significant outliers in the dataset that I would like to detect and remove. These are only found when everything is set to attribute. If I then later assign the role and perform the process, my accuracy drops from 66% to 30%...
Nothing else changes in the model, same selected attributes, same type of model,...
Any help?
Tagged:
0
Answers
-
There are two main objectives of outlier detection: (1) to identify and eliminate outliers in unlabelled data, e.g. for the purpose of clustering; (2) to identify outliers in labelled data to improve the performance of the predictive model. Clearly you are doing the second, in which case I would not be using the label value (the unknown) for outlier detection (among the known values). This means that your outlier detection should not depend on the value of the label and only the predictors. If indeed you were going to use all attributes in the outlier elimination process to build the model, you need to be very careful that you do not use the knowledge of outliers in the validation data sets when building the model. I suspect this happened in your case and this is why you have such a discrepancy in results. Also beware not to confuse outliers with normal variance, if you remove too many "outliers" you reduce variance and so your validation performance may be very poor!0