I am using the k-Nearest-Neighbour operator to get a model for my example set. However, from the operator description alone I am not totally clear about how the algorithm is implemented. I checked the source code of the operator as well but it's difficult to understand.
1st question:My example mixes numerical data and nominal data. With numerical data, there is no big problem in understanding the meaning of the term "nearest neighbour". It's different however with nominal data: For instance, there is an attribute with, let's say, 3 different possible nominal values or possibly a missing value: costs = low/medium/high/? (Btw: Is this called a 'polynominal attribute'?)
How does RapidMiner's KNN operator treat this when learning the model? Does it:
- skip such data? (Just ignoring it. The algorithm does not use this attribute for training.)
- use some kind of "binary matching decision" like: "If the current attribute value x
i is exactly the same as the target value x
j, then the neighbour is said to be 'near', whereas if they are different, the neighbour is said to be 'far'."?
- use any other algorithm?
2nd question:Furthermore: How are missing values being treated (numerical and/or nominal)?
3d question:How exactly is the weight being implemented that can be applied? In the context of KNN, as far as I know, the distance between x
i and x
j is multiplied with a weight
a / b
where
- a is the correlation coefficient and
- b is the standard deviation.
Is this what is meant with the "weighted_vote" parameter?
Thanks for the clarification.