Strange Results With Local Outlier Factor
Mickey
New Altair Community Member
I am getting strange results with the LOF operator: most of the "outlier" values are around the range of 0.15 instead of around 1.0.
However for most points LOF should be around 1.0 for the following reasons:
1) The LOF paper proves that LOF is around 1.0 for most points inside clusters.
2) It makes sense. From the way LOF works, you'd expect LOF around 1 for most points in clusters anyway!
3) My own implementation of a simpler variant of LOF (just average of k-dist) does give LOF of around 1 for most points.
I tried this both on my own data as well as data generated using RapidMiner, but the LOF from rapidminer is around 0.15 for both.
Here is code to recreate the synthetic test:
However for most points LOF should be around 1.0 for the following reasons:
1) The LOF paper proves that LOF is around 1.0 for most points inside clusters.
2) It makes sense. From the way LOF works, you'd expect LOF around 1 for most points in clusters anyway!
3) My own implementation of a simpler variant of LOF (just average of k-dist) does give LOF of around 1 for most points.
I tried this both on my own data as well as data generated using RapidMiner, but the LOF from rapidminer is around 0.15 for both.
Here is code to recreate the synthetic test:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="476" width="681">
<operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="165">
<parameter key="target_function" value="gaussian mixture clusters"/>
<parameter key="number_examples" value="1000"/>
<parameter key="number_of_attributes" value="2"/>
</operator>
<operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data (2)" width="90" x="45" y="255">
<parameter key="number_examples" value="20"/>
<parameter key="number_of_attributes" value="2"/>
</operator>
<operator activated="true" class="discretize_by_bins" expanded="true" height="94" name="Discretize" width="90" x="179" y="255">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="label"/>
<parameter key="attributes" value="label"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="313" y="255">
<parameter key="name" value="label"/>
<parameter key="target_role" value="label"/>
</operator>
<operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="447" y="165"/>
<operator activated="true" class="detect_outlier_lof" expanded="true" height="76" name="Detect Outlier (LOF)" width="90" x="514" y="30"/>
<connect from_op="Generate Data" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Generate Data (2)" from_port="output" to_op="Discretize" to_port="example set input"/>
<connect from_op="Discretize" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Detect Outlier (LOF)" to_port="example set input"/>
<connect from_op="Detect Outlier (LOF)" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
-
Hi,
unfortunately I'm not familiar with this algorithm and a short glimpse into the source code didn't make me any smarter. You seem to be at least half a way an expert for this, could you manage to take a look? If you find the problem, simply file a bug.
If you don't have the time, you could file a bug anyway, but I doubt we manage to find the problem in the next time.
Greetings,
Sebastian0 -
Unfortunately there's no way I could do a code review, for many reasons (in fact, for almost every reason you can imagine!). Sorry.
In any case I'm hardly an expert!
Far from it, I am a beginner, and I could be wrong about my expectation (i.e the bug could just as well be mine and not in RapidMiner). I posted the question hoping that someone who is an expert could give me his/her opinion.
0 -
Hi,
I see the problem. Hm. You have a link on the original paper? This operator has been contributed by the community, so here's nobody familiar with the implementation. We have to dive deeper into this matter, but this will take some time.
Greetings,
Sebastian0 -
It is definitely a bug. I just computed LO factors for a small file with RapidMiner (version 4.6) and R (library dprep) and got very different results. The LO factors for R are in the vicinity of 1 as expected. Those for RapidMiner get very small.
RapidMiner's:
0.10705703 0.05623235 0.13564975 0.09714966 0.10411321 0.05615648 0.13563153 0.16206154 0.05677983 0.17250688 0.09351030 0.17039931 14.70213398 0.03649292 0.08855556 0.62346659 0.05777326 0.41748211 0.35321167 0.62346724 1.02022896 0.37671896 0.15250039 0.62346824 0.17060555 0.15409052 0.17671467 0.35942272 0.08493053 0.54318228 0.09604710 0.12895404 0.05779714 3.51261825 0.17676736 0.40118616 0.62368668 0.05617499 0.09426575 0.40116545
R (dprep):
1.0593654 0.9767560 1.1121496 1.0199023 1.0593438 0.9767494 1.1121422 1.1121542 0.9556079 0.9757669 1.2527428 1.1488689 5.9182867 0.9885184 1.2827731 1.3887066 0.9582217 1.1607783 1.4223003 1.3887070 1.8413799 1.2956872 1.0276760 1.3887041 1.1488966 1.0235487 1.0242877 1.1174309 1.2828083 1.7161344 1.0182297 1.0177936 0.9581262 3.0147269 1.0243710 1.5825578 1.3887215 0.9767419 0.9739126 1.5825385
0 -
Hi,
thanks for the comparation values. I have added this as a bug report to the tracker.
Greetings,
Sebastian0 -
Here is the original paper for LOF: "LOF: Identifying Density-Based Local Outliers" can be found here http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.8948&;rep=rep1&type=pdfSebastian Land wrote:
Hi,
I see the problem. Hm. You have a link on the original paper? This operator has been contributed by the community, so here's nobody familiar with the implementation. We have to dive deeper into this matter, but this will take some time.
Greetings,
Sebastian
I hope that helps.0 -
I hope so
Thank you very much.0 -
Hey Guys,
Any update on this operator? If not is there any way to extract the outlier measure from KNN outlier detection.
i.e. can we some get a ranked list of outliers instead of just the top n.
Thanks,
-Gagi0 -
Hi Gagi,
sorry for that, but no progress at all.
It would take me at least a day to get deep enough into that matter to fix it. Compared to the effort, there are many more things that must be made before. I hope you understand that we have to give priority to the issues of our enterprise customers. At last it comes down to the fact that the community version of this open source product offers you the possibility to fix it yourself and send in the patch or otherwise become enterprise customer.
Sorry that I have to repeat this so often, but the more enterprise customer we have, the more developer would be available to fix the bugs and fulfill the feature requests...
Anyway you can use one of COF or LOF based outlier detection to get an 'outlierness' attribute. You can of course sort the examples after this attribute and only extract the first n.
Greetings,
Sebastian0 -
Thanks for the reply. I will have to check into programming for RM. I know you guys are doing your best. I am just concerned about operators that may output unreliable results. Is there any standard for testing these operators. I know we in the community are sort of beta-testers but for example most people don't question the PCA or KNN results because they are so well known, however some more obscure methods may be difficult to trust.
-Gagi0 -
Hi,
I usually trust what yields good results for my task. I think it will be difficult to get some sort of Gold standard for different implementations of the same algorithm. Just think of the three different SVMs in RM: They all deliver different results on the same task, although following the same algorithm. The problem is, that there are so many numerical issues addressed differently by various implementations...But of course you can try to compare implementations on a huge set of data sets with different properties to get an impression which works better.
Greetings,
Sebastian0 -
I have found the bug. It is located in the file com.rapidminer.operator.preprocessing.outlier.SearchSpace.java, line 725
Just have to change from this: for (int j = 1; j <= k; j++) {
to: for (int j = 1; j <= kMax; j++) {
0 -
Hi Pengie,
I have changed the source code trusting your judgment. I hope that you are right with that
Greetings,
Sebastian0