Community & Support
Learn
Marketplace
Discussions
Categories
Discussions
General
Platform
Academic
Partner
Regional
User Groups
Documentation
Events
Altair Exchange
Share or Download Projects
Resources
News & Instructions
Programs
YouTube
Employee Resources
This tab can be seen by employees only. Please do not share these resources externally.
Groups
Join a User Group
Support
Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
What model to use?
aoneil
Backstory: I just started using RapidMiner and I'm working with a system where a node will get pinged randomly throughout the day. From this I'm given a timestamp and I've also managed to split that timestamp up to give me month, day, hour, day of the week, and frequency per hour (not really sure if any of these features are actually significant). I'm trying to use RapidMiner to predict when a node goes 'missing'.
I want RapidMiner to take in all of this info and then spit out how confident it is that a node is missing/not missing based on how long it's been since the last ping vs. the frequency that the node has gotten in similar situations (ex. same day of week, same hour in previous days, etc). I'd be very thankful if anyone could point out some viable data models for me. If it changes anything, I also have a pretty large amount of data (been running my app for over 3 months).
Find more posts tagged with
AI Studio
Accepted answers
All comments
juanm_encinas
Without having seen your data I'd go for a classification algorithm. Maybe a decision tree if you just want to know what is the most likely outcome (missing/not missing) given a set of inputs (time variables, etc). Or a logistic regression if you want to know what is the probability that a node goes missing.
aoneil
Thanks! Would I be better off using some type of anomaly detection since the vast majority of my data points would be classified as 'not missing'? I've just had issues because the anomaly detection processes I tried seemed to be treating my date/classification variables (day/hour/month) as if they were numerical. So it'd give me unreasonably high anomaly scores for the date/times with small numbers (ex. 1 am January 1st) even though it's meant to be just a date.
juanm_encinas
I think that issue can be solved by defining your time attributes as nominal, not numerical. And convert nominal into numerical if the operator requires it, eg with logistic regression.
An advantage of decision trees is that you can work straight with nominal attributes. You have the same advantage with an anomaly detection operator such as k-NN Global Anomaly Score. You can go either way.
MartinLiebig
Hi,
internally dates are stored as integers since 1970. Some algorithms from the anomaly extensions are indeed treating them as this number. My personal tip would be to use Date to Numerical first and translate it to something useful. E.g. Week since 1970.
Another point is, that you seem to have a very imbalanced problem. Means you have way more not missing points than missing. You should consider either to use downsampling (Sample operator) or using Weights (Generate Weights (Stratification) operator. Be also sure to use a correct performance measure.
While i generally agree that decision trees are a fine way to start, i would recommend considering to use a Random Forest as a second step. It is known to be stronger than a decision tree.
And a point on k-NN Global anomaly score: Consider to use LOF instead. It is a bit stronger in my eyes.
~Martin
aoneil
Forgot to thank you, Martin. I did what you said and it worked out great!
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups