Anomaly detection in Rapidminer with one label column Yes/No

Question

Hello All,

I have a dataset with 1000 rows which has one column contains Yes/No to identify as Anomaly. I want to use this dataset to train the model. Which model i should use in supervised techniques and how can I design my job which has 2 inputs one as training set with label and another one without label.

Any sample process will be very helpful.

Thanks,
Indhumathi

Indhumathi · Answer

Hello,

Thank you for your suggestions.

Yes I have analysed the step 2 output i.e decision tree predictions/Simulator and could see the set of attributes affecting the score.If I did the same LOF output into Random Forest Model I could see different set of attributes affecting the score.Now both Decision Tree and Random forest prediction output are not much closure to original LOF outlier score.So which method can I prefer ?

1)How can I compare which method is predicting correctly?

2)I mean if anomaly is based on particular set of attribute (A,B) then I need to provide a solution like atribute A and B to be properly configured in system. If its based on C,D then correct threshold should be set to avoid overbooking.

Telcontar120 · Answer

What do you mean by "provide a solution for pattern in the anomaly"?  If you are talking about describing the relationship between individual attributes and the outcome, take a look at the operators "Explain Predictions" and "Model Simulator".  These allow you to look at how changes in independent variables affect your predictions based on the selected model, even when it is very complex.

Indhumathi · Answer

lionelderkrikor.I have used automodel with Random forest to train the model and then used Apply model to test on TEST set.Now its working fine.The Anomaly flag column I have created manually based on 2 column values like below,

A          B      Anomaly
1000    0         0 (No)
50        0          1(Yes)
40        1          0(No)
23        1          0(No)
            0           0(No)

Now I want to know any other columns affects the Anomaly i.e Instead of I am telling the model that based on only 2 columns Anomaly flag is marked, the system should tell me these other columns C ,D ,E also affects Anomaly flag, these could also possibilities.
To achieve above I have tried below 2 methods:

1)Built a LOF unsupervised model.    ---I don't know based on which column it is assigning the outlier score
2)Feed the LOF output column - "outlier score" as label into Decision tree Automodel to check which attribute is contributing to the score.I have checked in predictions tab that various color depth of red(contradict) and green(Support).But I am sure that the green highlighted columns should not cause anomaly.How can I change that?

Also I want to provide a solution for pattern in the anomaly.How can I achieve that with models?

Thanks,
Indhumathi