Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
Removing Univariate Outliers (IQR)
dragoljub
Hi Everyone,
I would like to quickly and easily remove univariate outliers using Interquartile range (IQR). I have looked for an easy way to do this but I seem to be stuck with the available RM outliers. I know RM computes IQR for the box plots, but is there an operator that can simply do this and drop everything say outside 1.5*IQR?
Also removing these outliers is essential to avoid trouble with z-transform normalization since the standard deviation can be significantly skewed by a gross outlier. Things you start encountering with real data...
Maybe there is an easy way to do this with the R extension (coming soon)?
Any Suggestions, ???
-Gagi
Find more posts tagged with
AI Studio
Accepted answers
All comments
land
Hi Gagi,
well, what about generating an attribute defining if something is within 1.5 IQR? You can extract the mean and standard deviation from the extract macro operator and then use this values inside the Generate Attributes operator.
If you are going to make this more usable by implementing an operator, it would be very kind if you would contribute it.
Greetings,
Sebastian
dragoljub
Hi Sebastian,
The problem with mean and standard deviation is that they are
not robust
. For example, if I have a 10 sigma outlier in one of my attribute columns the mean of that column is severely skewed also the variance is messed up. This can be a significant problem when trying to z-transform data for processing.
IQR is based on the
median
. I know you can extract the median for a column, but then you need the upper and lower quartiles. See below:
I know this can easily be done in R (
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/IQR.html)
. So I might just wait until your R extension is out.
In any case, having the option to normalize data based on standard deviation and zero mean centering is great, however it is essential to have median centering and normalizig by 1.349 IQR. See Below:
For normally N(m,1) distributed X, the expected value of IQR(X) is 2*qnorm(3/4) = 1.3490, i.e.,
for a normal-consistent estimate of the standard deviation, use IQR(x) / 1.349.
This would be a great addition to RM.
-Gagi ;D
land
Hi,
well, if you have a piece of code for this, that would fit into the com.rapidminer.operator.preprocessing.normalization.Normalization operator, I would just include this option in the next release. Unfortunately we are currently to busy to add it ourselves, to many working places at once...
Anyway I find this a good idea and if you don't send code, please send in a feature request as detailed as possible
Greetings,
Sebastian
dragoljub
Hi Sebastian,
I will try to get an operator made once R is integrated. Once I get RM building from source I will take a look at modifying the code.
Thanks,
-Gagi
dragoljub
Just realized IQR made it into the normalization operator! ;D Thanks for integrating this guys!
You should have a check list of things added so we can truly appreciate the good work you do!
-Gagi
land
Hi,
actually we forgot to mention this, there have so much been added...
And actually you have to thank brendon who contributed this!
Greetings,
Sebastian
dragoljub
Yea I asked Brendon implement it since he had more experience building RM from source. Thanks again for taking the time to include it in the latest RM release.
-Gagi
Jeroen8
@land
any update on this? I am interested in an operator to remove univariate outliers using Interquartile range (IQR) aswell
MartinLiebig
Hi
@Jeroen8
,
thats an old thread
. The operator Detect Outliers (Univariate) in operator toolbox extension allows you to do this.
Best,
Martin
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups