750000+ instances
mjb
New Altair Community Member
I have 400 datasets of time series data with a total of 750000 instances and 40 attributes. There is one class attribute and I'm trying to find not only the class, but its probability.
Up to now I've used Weka and have found Random Forests to give the best results, both in terms of correct classification and useful probabilities. However there are are few things that trouble me:
1. If I split the data into 2/3 training and 1/3 testing by random selection I get implausibly good fit in the tests. If I train on the contents of 266 datasets and test on the contents of 134 datasets I get more plausible results. I conclude that neighbouring instances are strongly correlated. But even then the model that give the best fit on the test data still give much better fit on the training data. So I conclude I'm still getting overfitting. Should I be attempting to find a model that performs equally well on training and testing? How would I do this?
2. For many learning schemes I'm limited by available memory. I can get more data without too much trouble, but haven't done so because of memory limits. Does anyone know how to reduce memory requirements (and runtime too)? I speculate that it should be possible to reduce the memory requirement and runtime of existing algorithms by clustering neighbouring instances together and replacing them with one instance of equivalent centre of gravity and weight. Does anyone know if this idea works?
3. I'd really like to know the reliability of the learners I'm generating. I can tolerate a learner that gives good results in some areas but poor resulst elsewhere, provided I know what those areas are. Any ideas?
If this is not a good forum for these questions please could you refer me to a good one. Thanks.
Up to now I've used Weka and have found Random Forests to give the best results, both in terms of correct classification and useful probabilities. However there are are few things that trouble me:
1. If I split the data into 2/3 training and 1/3 testing by random selection I get implausibly good fit in the tests. If I train on the contents of 266 datasets and test on the contents of 134 datasets I get more plausible results. I conclude that neighbouring instances are strongly correlated. But even then the model that give the best fit on the test data still give much better fit on the training data. So I conclude I'm still getting overfitting. Should I be attempting to find a model that performs equally well on training and testing? How would I do this?
2. For many learning schemes I'm limited by available memory. I can get more data without too much trouble, but haven't done so because of memory limits. Does anyone know how to reduce memory requirements (and runtime too)? I speculate that it should be possible to reduce the memory requirement and runtime of existing algorithms by clustering neighbouring instances together and replacing them with one instance of equivalent centre of gravity and weight. Does anyone know if this idea works?
3. I'd really like to know the reliability of the learners I'm generating. I can tolerate a learner that gives good results in some areas but poor resulst elsewhere, provided I know what those areas are. Any ideas?
If this is not a good forum for these questions please could you refer me to a good one. Thanks.
Tagged:
0
Answers
-
Hi,
it seems to me, you are already familiar with the basics of data mining, so I will directly answer your questions.
1. If you have time series data and window it, there might be examples from the future coming into the training set. This might cause implausibly good results as you have specified. In RapidMiner we have the SeriesXValidation for that purpose, that will ensure to use only examples from the past.
2. This depends on the type of your time series. You might use the Series Processing Extension to extract single features of the time series as frequencies or something like that, that might capture the things in one single number instead of adding each time point. This not only improves performance but additionally might improve quality of prediction.
3. For classification there are confidence attributes, but this is rather on the level of single examples and not on entire regions. Umpf. You could use data mining on the results to describe regions which low confidence
Greetings,
Sebastian0 -
Thanks. Lots of interesting points.Sebastian Land wrote:
1. If you have time series data and window it, there might be examples from the future coming into the training set. This might cause implausibly good results as you have specified. In RapidMiner we have the SeriesXValidation for that purpose, that will ensure to use only examples from the past.
2. This depends on the type of your time series. You might use the Series Processing Extension to extract single features of the time series as frequencies or something like that, that might capture the things in one single number instead of adding each time point. This not only improves performance but additionally might improve quality of prediction.
3. For classification there are confidence attributes, but this is rather on the level of single examples and not on entire regions. Umpf. You could use data mining on the results to describe regions which low confidence0