Normalize the nominal data and optimize the parameter of outlier analysis

yanjun_li
yanjun_li New Altair Community Member
edited November 5 in Community Q&A

Hello, everyone,

 

Now I have got two questions about the topic data normalization and outlier analysis. Could someone kindly help me?
1. The data of mine are all from a production company and are all nominal data. How could I optimize the parameter in the outlier analysis, like the number of neighborhood and the number of outliers. I have read a article, the author says that it can´t be done, because we usually do not have a test set in unsupervised anomaly detection.

2. Could I normalize the nominal data? Many people in the internet say that, it makes no sense to normalize the nominal data. I have found a answer in the internet. But I couldn´t realize it in rapid miner. 

Here is the answer about the normalize of the nominal data in the internet:

To compare two nominal variables that may be measured using different scales, you would like to "normalize" the values so you can see how well they correspond to each other. There is no simple normalization technique to do this, but it can be done.
One approach: Construct a contingency table to cross-classify observations of one variable against the other. Then, if all observations falling into a given category or the row variable fall into just one category of the column variable, you can establish a 1-1 mapping of one measurement to the other. i.e., in variable A, a "2" corresponds to a "92" means in variable B
More sophisticated approach: use a technique called correspondence analysis (particularly a variant called optimal scaling) to work out a set of scores that maximize correspondence

 

Here are my two questions, thanks for your help in advance!

 

lutherli

Tagged:

Answers

  • yanjun_li
    yanjun_li New Altair Community Member

    Hello, everyone,

     

    here are the answers from Dr. Ingo Mierswa. Firstly I would like to thank for his kindly help :)

     

    Re 1) Indeed you can't really optimize the outliers in a strict sense (since it is an unsupervised problem). You can turn this into a somewhat supervised problem if you happen to know for sure that some data points are outliers and some are not.
    Then you can try and tune the parameters in such a way that the known outliers are detected but none of the clear non-outliers. This is often more of a manual tuning approach though. Another idea could be to make the settings in a way so that a certain percentage is marked as outliers. This simple approach often works well if you already know that a certain percentage is outliers. And last but not least you can try to remove outliers based on the performance of a succeeding supervised machine learning method, i.e. you remove the outliers so that the accuracy of a classifier is improved. This can even be automated in RapidMiner.

     

    Re 2) Something similar is done by a combination of two operators: First "Nominal to Numerical" with Dummy Encoding and then "Normalize" with a setting to "z-Transformation".

     

    Best,
    Ingo