Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

weird results for k-nn model that I dont understand..

hi,

I was trying out k-nn on my dataset (4500 example rows and 25 numeric attributes), the thing here is, altough it's said that one should normalize the attributes, the performance decreases sharply when I use different kinds of normalization..

2nd, when I use feature selection with only the best 6 attributes, I have a higher performance, so far, so good.

But the thing that wonders me now is that when I built in a sampling (Bootstrapping) operator before k-nn, I get much higher performance, about 90% opposed to 80% before that... is it because it puts more weight on some instances, and leaves out other instances so that you achieve a much higher classification?

Find more posts tagged with

AI Studio

Accepted answers

IngoRM

Hi,

Both things can happen and actually the reason is similar in both cases:

Normalization of Attributes: k-NN is using a similarity measure (typically Euclidean) and if you do NOT know which attributes are more important than others for the classification or regression at hand it is typically the best idea to normalize the attributes before to not give accidentally a noise attribute with a large value range a huge influence over actually meaningful attributes with a smaller range. But if you actually would know this beforehand, then you would probably manually assign largers weights / larger ranges to those important attributes yourself. In your case, it looks like the more important attributes just happened to have a larger range already which is why the normalization led to some decrease in prediction performance. You can combine the normalization with some feature selection or weighting to compensate for this which should lead to a better and also more robust models. This is what you have observed for your 6 out of the 25 features.
Sampling of Examples: This can happen without bootstrapping already but bootstrapping might even make the effect stronger. If you would not sample at all, all examples will get the same weight for the model. If you sample, let's say, 10% of the data, and happen to find a subset which is having less noisy data, the quality of your predictions will be better. And with bootstrapping you might give good examples even more weight. An easy way to see how large the influence of sampling can be is to place a breakpoint after the Performance operator in a cross validation and observe how much the performance values vary for the different folds. Models which are optimized by example selection tend to be a bit less robust in my experience so I would be careful with this approach.

Hope that helps,

Ingo

All comments

IngoRM

Hi,

Both things can happen and actually the reason is similar in both cases:

Normalization of Attributes: k-NN is using a similarity measure (typically Euclidean) and if you do NOT know which attributes are more important than others for the classification or regression at hand it is typically the best idea to normalize the attributes before to not give accidentally a noise attribute with a large value range a huge influence over actually meaningful attributes with a smaller range. But if you actually would know this beforehand, then you would probably manually assign largers weights / larger ranges to those important attributes yourself. In your case, it looks like the more important attributes just happened to have a larger range already which is why the normalization led to some decrease in prediction performance. You can combine the normalization with some feature selection or weighting to compensate for this which should lead to a better and also more robust models. This is what you have observed for your 6 out of the 25 features.
Sampling of Examples: This can happen without bootstrapping already but bootstrapping might even make the effect stronger. If you would not sample at all, all examples will get the same weight for the model. If you sample, let's say, 10% of the data, and happen to find a subset which is having less noisy data, the quality of your predictions will be better. And with bootstrapping you might give good examples even more weight. An easy way to see how large the influence of sampling can be is to place a breakpoint after the Performance operator in a cross validation and observe how much the performance values vary for the different folds. Models which are optimized by example selection tend to be a bit less robust in my experience so I would be careful with this approach.

Hope that helps,

Ingo

marcin_blachnik

I agree with Ingo, what I can suggest in your case is to use Information Selection extension especially Select by ENN or by RENN/All-kNN. In case you have lots of noise these algorithms will prune your data fand keep just "good" samples not noisy samples.

Best

Marcin