What model to use?

aoneil · March 2016

Backstory: I just started using RapidMiner and I'm working with a system where a node will get pinged randomly throughout the day. From this I'm given a timestamp and I've also managed to split that timestamp up to give me month, day, hour, day of the week, and frequency per hour (not really sure if any of these features are actually significant). I'm trying to use RapidMiner to predict when a node goes 'missing'.

I want RapidMiner to take in all of this info and then spit out how confident it is that a node is missing/not missing based on how long it's been since the last ping vs. the frequency that the node has gotten in similar situations (ex. same day of week, same hour in previous days, etc). I'd be very thankful if anyone could point out some viable data models for me. If it changes anything, I also have a pretty large amount of data (been running my app for over 3 months).

juanm_encinas · March 2016

Without having seen your data I'd go for a classification algorithm. Maybe a decision tree if you just want to know what is the most likely outcome (missing/not missing) given a set of inputs (time variables, etc). Or a logistic regression if you want to know what is the probability that a node goes missing.

aoneil · March 2016

Thanks! Would I be better off using some type of anomaly detection since the vast majority of my data points would be classified as 'not missing'? I've just had issues because the anomaly detection processes I tried seemed to be treating my date/classification variables (day/hour/month) as if they were numerical. So it'd give me unreasonably high anomaly scores for the date/times with small numbers (ex. 1 am January 1st) even though it's meant to be just a date.

juanm_encinas · March 2016

I think that issue can be solved by defining your time attributes as nominal, not numerical. And convert nominal into numerical if the operator requires it, eg with logistic regression.

An advantage of decision trees is that you can work straight with nominal attributes. You have the same advantage with an anomaly detection operator such as k-NN Global Anomaly Score. You can go either way.

MartinLiebig · March 2016

Hi,

internally dates are stored as integers since 1970. Some algorithms from the anomaly extensions are indeed treating them as this number. My personal tip would be to use Date to Numerical first and translate it to something useful. E.g. Week since 1970.

Another point is, that you seem to have a very imbalanced problem. Means you have way more not missing points than missing. You should consider either to use downsampling (Sample operator) or using Weights (Generate Weights (Stratification) operator. Be also sure to use a correct performance measure.

While i generally agree that decision trees are a fine way to start, i would recommend considering to use a Random Forest as a second step. It is known to be stronger than a decision tree.

And a point on k-NN Global anomaly score: Consider to use LOF instead. It is a bit stronger in my eyes.

~Martin

aoneil · March 2016

Forgot to thank you, Martin. I did what you said and it worked out great!

What model to use?

Answers

Categories