Handling Outliers
Hi All,
I have a question about outlier handling in Rapid Miner. Consider the following values of an attribute:
2
3
3
2
4
17
5
3
28
4
5
5
2
5
26
1
3
3
4
2
I have shortened the actual list of attribute values simply for the purposes of this question. As you can see, most of the observations are between 1 and 5. With a mean of just over 6, and a standard deviation of about 7.5, three values stand out as outliers, with two of them being more than two standard deviations away from the mean. With a much larger number of observations, these outliers become even more pronounced.
I've analyzed this variable using the Outlier operators in RapidMiner, and I'm confident that values above 8 are outliers, so my question is not about how to detect the outliers. My question is about *handling* them. I can use a filter to remove the entire observation associated with each outlier from my data set, but these observations have other attributes with accurate data that I want to keep and use in my model.--what I really want to do is set the outliers in this attribute to missing, and then use a missing value handler to set them to a normed value -- something like the mean, median or mode. This way I can keep the good attributes, "fix" the bad ones, and model without losing any observations. (I'm aware of the risks/issues associated with replacing data using descriptive statistics, and I'm choosing to accept those in this particular instance.)
I've tried doing this six ways from Sunday, but I'm not coming up with good ideas. I can add a series of "Declare Missing Values" operators in rapid succession and use each one to set each distinct outlier value to missing, but this is a real pain, especially if I have ten million observations with 25 distinct outlier values. I end up with 25 DMV operators in a row, cluttering my model and making it difficult to tell if I've treated all the outliers or not. I tried using a DMV numeric_value_filter with an expression of >8, but this parameter must not work the way I think it's supposed to, because I always get an 'attribute name undefined' error when I try that.
So, to boil it down: I would like to be able to set a range or expression on a numeric attribute that would set all values to missing in that attribute if they fall outside the range, or meet the expression parameters, and I'd like to accomplish it with a single operator. Any suggestions or solutions would be most appreciated.
Matt