"Correlation, weird behavior"
wessel
New Altair Community Member
[begin edit] Dear All, [end edit]
The following data has correlation: 0.999
I never knew that correlation was so much effected by outliers.
Best regards,
Wessel
The following data has correlation: 0.999
# sum prediction(sum) a1 a2 a3Is this how correlation is supposed to work?
1 6.0 11.06672979160903 1.0 2.0 3.0
2 9.0 11.066728936515114 2.0 3.0 4.0
3 15.0 11.066735677516975 9.0 2.0 4.0
4 11.0 11.066728936098524 4.0 5.0 2.0
5 16.0 11.06672900369881 6.0 1.0 9.0
6 5.0 11.066728942691093 0.0 3.0 2.0
7 4.0 11.066728979026438 0.0 3.0 1.0
8 9.0 11.066728936099063 3.0 5.0 1.0
9 359.0 349.5374686083969 344.0 8.0 7.0
I never knew that correlation was so much effected by outliers.
Best regards,
Wessel
Tagged:
0
Answers
-
Hello Wessel,
what about saying hello before bursting out some statement?
Regarding your question: Yes it is. Correlation is built upon the average of the covariances which are the products from the difference of each value to it's attribute's mean value.
Or do you suggest that we have an error in the calculation routine? Then please specify the process you used and give some comparable results from another software.
Greetings,
Sebastian0 -
No, I'm not suggesting an error in calculation.
Just to be sure I ran the same experiment both in WEKA and in Rapid-Miner.
Both give the same results.
So no, the calculation is fine.
(Chances of Rapid-Miner being wrong are small :P,
Chances of both WEKA and rapid-miner being wrong are really small)PerformanceVector
It seems undesirable that a performance measure is very depended on trivial things, such as outliers in the data.
correlation: 0.999
absolute_error: 22.853 +/- 5.105
PerformanceVector: root_mean_squared_error: 23.416 +/- 0.000
[[normalized_absolute_error]]: 0.331
root_relative_squared_error: 0.213
=== Summary ===
Correlation coefficient 0.9994
Mean absolute error 22.853
Root mean squared error 23.4161
[[Relative absolute error]] 33.0906 %
Root relative squared error 21.2978 %
Total Number of Instances 9
So when using correlation as a performance measure, it is very important to keep this behavior in mind.
I'm thinking about a modified correlation measure that that is more robust with respect to outliers.
Simply rescaling won't do the job, because covariances are in-depended on scaling.
0 -
Hi Wessel,
do you know any literature about that? It seems very likely to me, that some else already stumbled over this issue.
And you are right. One have to keep that in mind, but when you are thinking about the plot of your values, every human would assume that there's a linear dependency.
Greetings,
Sebastian0