🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Improving Test & Out of Sample Perf with Opt Selection and Auto Feature Generation

User: "Noel"
New Altair Community Member
Updated by Jocelyn

Apologies for cc’ing everyone, but I really need some help!

I have a data set which started with 15 attributes and the two calculations which were needed to create the labels in excel (the label has three distinct values, but for my purposes, I’m only interested in two of them). The data is a time series with about 1300 periods/5+ years for training and 250 periods/1 year for testing. In RapidMiner, I calculate 20 period aggregations and create windows of 10 periods.

Using the full feature set, I trained a GBT that is about 65% accurate in training:



The testing performance is “really bad”, however, less than 50% accuracy:



Note: I think I’ve attached all the relevant files: Data_Labeler_v16_help.xlsm (the labeler), help.xlsx (the data), 103b_create_AM_help.rmp (training process), and 103d_apply_AM_help.rmp (testing process).

I’ve been most focused on improving the testing accuracy and with Ingo’s “Multi-Objective Feature Selection” series, have been working with the Optimize Selection (Evolutionary) and Automatic Feature Engineering operators. AFE has worked “best” so far.

Training the GBT using the AFE feature set achieves just over 65% accuracy:



and ~55% in testing...



Headed in the right direction, but still a ways to go.

I’ve also included the AFE training process, AM_afe_help.rmp, and apply_AM_afe_help.rmp (for testing). The AFE ran for a while so I also included the feature set “features_AM_afe_help” and model “model_AM_process_afe_help”.

My question is this: how can I squeeze some more accuracy from this data (especially testing/out of sample accuracy)? Any suggestions are much appreciated… I’m trying to demonstrate a win for machine learning in my firm’s area of interest, but I only have another week to do it in.

Many thanks,
Noel
Sort by:
1 - 4 of 41
    User: "hughesfleming68"
    New Altair Community Member
    Accepted Answer
    Updated by hughesfleming68
    Hi Noel, I have just seen this but I will take a look. I am going to have to reread it again as well but my first thought is that your testing window might be too large and a windowing set to 10 might be too small for daily data. Are you logging your accuracy over those test periods or just looking at the average? You would expect that a model that is actually predicting something would have higher accuracy close in time to your training data and then decay. This is typical for financial time series. A random model might have no pattern. Changes in market regime can be quite significant from year to year. You may have a couple of years where everything is just working and then all of a sudden things fall apart.
    User: "MartinLiebig"
    Altair Employee
    Accepted Answer
    Hi,
    as a general point: If training and testing error diverge, you most likely over-trained. Either because of id-ish attributes or because of a too complex model.

    Cheers,
    Martin
    User: "tftemme"
    New Altair Community Member
    Accepted Answer
    Hi @Noel,

    Thanks for the response.

    For the first part, unfortunately my answer is: this probably depends on your data. Basically you have to pose the questions how long you expect your data to influence your horizon. So do you expect that changes in your data for example two weeks ago, are still effecting your forecast. You could investigate this by for example checking the Auto Correlation of your data. The largest significant lag you expect should be your window size of the Windowing operator (to provide the GBT with this influence factors). 
    For the training size of the Sliding Window Validation I would try to come up with as much data as possible (I got the impression that in most cases more training data at least does not harm the test performance). There has to be obviously enough data in the testing window to have a proper evaluation of the testing performance. Honestly I don't look much on the training performance, as the testing performance is the one I am interested in.

    For the second part: Thank you for reporting. I could reproduce the issue, it seems that the Windowing operator behaves wrongly if the indices attribute is selected at the same time as an time series attribute. I directly start with the fix ;-)

    Best regards,
    Fabian
    User: "tftemme"
    New Altair Community Member
    Accepted Answer
    Hi @Noel,

    Sorry I thought I have answered this already. So I investigated the second issue. It seems that there is (or better was, cause this is fixed in the 9.5.0 version of RM Studio (you can try it out in the BETA version)) a bug, if the indices attribute is also selected as the time series attribute. In your case you used all attributes (including special ones) as time series attributes and the 'date' attribute selected as indices attribute. This caused in the end, that the data was not used correctly and the label attribute (and the windowed date attribute) did not have the right values at all.

    If you don't want to upgrade to the beta yet, you can also exclude the 'date' attribute from the time series attribute selection (for example use single and invert, or deselect include special).

    For your first issue, I honestly don't look so often at the training performance. All what matters for evaluating the performance is the application on the (independent) test set. And unfortunately most of the time also the "needed" window sizes also depends on your data. So hard to answer your question. But I have some rule of thumbs I can share, maybe they help.

    You have basically two different window sizes in your process. The window size of the first Windowing Operator (and of the Process Windows operator), which defines how larger your "profile building" window is. You should try to make it as large as you expect that past effects can influence your forecast. So when you expect that the entries 7 days ago, influence your data your window size should be at least 7. You can investigate this dependency a bit more systematically by using the Auto Correlation operator. 
    The second size(s) you have to consider are the window sizes of the Sliding Window Validation. Here it is roughly the same situation as in non time series related cases. You want to have a trade off between large training size (to increase the number of examples your GBT is trained on, and therefore normally also the performance), proper test size (to have a statistical proper number of examples to evaluate) and a reasonable runtime (when your step size is small you have a large amount of validation iterations). So in your example you have 1309 Examples for the Sliding Window Validation.
    If you want to have a similar behavior as a 10-fold crossvalidation (10 iterations, training set is 9 times the size of the test set) you could use the following settings:
    training size: 600
    testing size: 70
    step size: -1 (for the old Sliding Window Validation, which means same as testing size); 70 for the new Sliding Window Validation
    By the way, when you use the new Sliding Window Validation, you could profite from the parallelization of the operator. The old one is not parallelized.

    Hopes this helps,
    Best regards,
    Fabian