Hello, everyone,
Now I have got two questions about the topic data normalization and outlier analysis. Could someone kindly help me?
1. The data of mine are all from a production company and are all nominal data. How could I optimize the parameter in the outlier analysis, like the number of neighborhood and the number of outliers. I have read a article, the author says that it can´t be done, because we usually do not have a test set in unsupervised anomaly detection.
2. Could I normalize the nominal data? Many people in the internet say that, it makes no sense to normalize the nominal data. I have found a answer in the internet. But I couldn´t realize it in rapid miner.
Here is the answer about the normalize of the nominal data in the internet:
To compare two nominal variables that may be measured using different scales, you would like to "normalize" the values so you can see how well they correspond to each other. There is no simple normalization technique to do this, but it can be done.
One approach: Construct a contingency table to cross-classify observations of one variable against the other. Then, if all observations falling into a given category or the row variable fall into just one category of the column variable, you can establish a 1-1 mapping of one measurement to the other. i.e., in variable A, a "2" corresponds to a "92" means in variable B
More sophisticated approach: use a technique called correspondence analysis (particularly a variant called optimal scaling) to work out a set of scores that maximize correspondence
Here are my two questions, thanks for your help in advance!
lutherli