Improving Test & Out of Sample Perf with Opt Selection and Auto Feature Generation

New Altair Community Member

Sep 27, 2019

@tftemme / Fabian-

Is it possible that the time series aspect of this data set (or the way I structured my process in terns of the GBT and sliding validation) is contributing to the disconnect between training and testing performance?

Thanks,
Noel

( @IngoRM, @yyhuang, @varunm1, @hughesfleming68, @mschmitz, @sgenzer )

New Altair Community Member

One final plea for aid... ( @IngoRM, @yyhuang, @varunm1, @hughesfleming68 )

I took a step back this weekend and tried to enumerate all the moving parts in my analysis:

Label creation (criteria, related calculations, *alignment*)
Matters relating to the TimeSeries aspect of my data (aggregation periods and types, window size, validation methodology)
GBT tuning (both trees in general and boosting specifically: max depth, num trees, num bins, learning rate, min split improvement, etc.)
Feature creation (some overlap with timeseries aggregations) and selection

I read a bunch of posts in the community and came away thinking that its best to configure the GBT (thank you, @mschmitz ) and be sure to have a solid validation approach in place (thank you, @Telcontar120 ) before focusing on feature weighting, creation, selection, etc.

So, I covered much of #2 and #3 (see below). If anyone has any suggestions for other GBT and timeseries tweaks, please let me know.

At this point, is it all about the features? Current results; training on top, testing on bottom (process and data attached):

Image: https://us.v-cdn.net/6030995/uploads/editor/pl/bltcq5jh648t.jpg

Thanks,
Noel
-----

TimeSeries: I went with the basic aggregations to start (mean, median, max, min, stdev) and looked at aggregation periods and window sizes:

Aggregation period: 6, Window size: 5

Image: https://us.v-cdn.net/6030995/uploads/editor/hs/ug4c5iaka4u0.jpg

I looked carefully at the Sliding Window Validation operator. I had been using training and testing windows of 100 with steps sizes of their combined width. I came across @sgenzer 's timeseries challenge and tried the validation settings discussed therein: cumulative training, single period test windows, multiple iterations, and none of it seemed to have any impact:

I also did my best to nail down the GBT parameters:

Num trees vs Depth for three learning rates (0.09, 0.10, 0.11)

help.rmp

help.xlsx

New Altair Community Member

Accepted Answer

Updated Sep 30, 2019 by hughesfleming68

Hi Noel, I have just seen this but I will take a look. I am going to have to reread it again as well but my first thought is that your testing window might be too large and a windowing set to 10 might be too small for daily data. Are you logging your accuracy over those test periods or just looking at the average? You would expect that a model that is actually predicting something would have higher accuracy close in time to your training data and then decay. This is typical for financial time series. A random model might have no pattern. Changes in market regime can be quite significant from year to year. You may have a couple of years where everything is just working and then all of a sudden things fall apart.

New Altair Community Member

Thanks, Alex. Much appreciated!
I'll have a look at that. Great suggestion.

New Altair Community Member

Alex- while I think both of your observations were correct, I’m still not able to bridge the training and testing performance gap. If you have any more thots/suggestions, etc. I’m all ears.

Thanks again.

New Altair Community Member

Updated Oct 1, 2019 by hughesfleming68

Hi Noel, I am checking your data prep. 103b_create_AM.If I break on the first filter examples operator, it is giving me a constant label. Is this correct? It could be way too early in the morning and I need more coffee.

Image: https://us.v-cdn.net/6030995/uploads/editor/ed/91dchfhl2r5j.jpg

Let me know if I am at least reading the right files. Usually if you feel that something really should work better, the problem is most likely some transformation on your attributes that is killing your signal by mistake.

Any kind of feature selection risks over fitting the training data especially when the signal to noise ratio is low. It can certainly make a good base model better but watch out if it is making a really big difference. You may have to shift your data few times to see if there is consistency with regards to which attributes are being thrown out.

What is really jumping out at me is that you are sampling down your training set to 1000 before automatic feature selection. I wouldn't do this. Try and keep the sequences in tact and remove any randomness. Try using the last 1000 samples instead.

Your process is complicated but still a lot easier than digging through code. I see that you are down sampling a couple of times in your other processes and you are not using local random seed. My fear of this maybe unjustified. It might be fine to do this. I don't but that is just me. I am actually curious what other people think. Anyone?

Alex

MartinLiebig

Altair Employee

Accepted Answer

Hi,

as a general point: If training and testing error diverge, you most likely over-trained. Either because of id-ish attributes or because of a too complex model.

Cheers,

Martin

New Altair Community Member

Thanks for looking under the hood, Alex. That first process just uses an exported auto model process as a base. The most recent process I uploaded is much simpler to go through. (I should have reposted the other files so the second post was self contained). The labeler and source file from the first post work with the second process. I’ll repost.

New Altair Community Member

Alex- Here's the second process. No magic. Just calculating aggregations and windowing.

help_ii.rmp

help.xlsx

New Altair Community Member

Updated Oct 1, 2019 by Noel

Martin / @mschmitz - For a timeseries data set, how many periods of daily data is sufficient for training? Thanks, Noel

New Altair Community Member

Updated Oct 1, 2019 by hughesfleming68

Hi Noel, I had to substitute the windowing operator for the older windowing operator from the value series extension for your process to work. I will see what is going on. Could you confirm that you are getting all three classes windowed properly in your help_ii process for your label?

Thanks

New Altair Community Member

Alex- Strange that it didn't run out of the box. When you say properly windowing all three classes, I'm not sure what you mean. I exclude all labels but the horizon from the data so the embedded information about future does not leak through. I meant for only the numeric attributes to be aggregated and windowed.

New Altair Community Member

Updated Oct 1, 2019 by hughesfleming68

Yes. I get the following error.

It comes from the windowing operator. Substituting the value series windowing operator fixes the problem. We can continue this via private mail if you wish. I just want to make sure that I am seeing what you are seeing.

I also had to adjust the filter examples attribute names for the data attribute.

When it runs, I get this. Using GLM is slightly better.

Image: https://us.v-cdn.net/6030995/uploads/editor/jk/m1ymt2sb0xxd.jpg

Check my adjusted version to see.

New Altair Community Member

That sounds good (private email), Alex. I tried to DM you, but I don't think it went through.

New Altair Community Member

I just sent you a PM.

New Altair Community Member

Oct 2, 2019

Thanks to everyone who suffered through my posts to help out! I very much appreciate it!

( @IngoRM, @yyhuang, @varunm1, @hughesfleming68, @mschmitz, @sgenzer @CraigBostonUSA @Pavithra_Rao )

New Altair Community Member

Oct 7, 2019

Hi @Noel

Sorry for not responding earlier. This seems to be solved, right? I just skimmed through the thread. There seemed to be an issue with the Windowing operator and the GBT, I think @hughesfleming68 is reporting about this. Is this still an issue?

Best regards,
Fabian

New Altair Community Member

Oct 7, 2019

@tftemme Hi, Fabian. I have just started to use the new operators. I will try and reproduce the error later today. If I discover something, I will let you know.

Alex

New Altair Community Member

Oct 8, 2019

Hi Fabian / @tftemme

There are two issues. The first has to do with GBTs and time series data. For daily data, is there a "right" amount of training that is sufficient for the task, but avoids overfit and the divergence between model testing and training performance?

The second issue I think has to do with the core windowing operator's behavior in 9.4. It seems to change all the labels to a single value which leads to the GBT complaining about the response being constant during validation (the error @hughesfleming68 reported).

Thanks,
Noel

New Altair Community Member

Accepted Answer

Oct 9, 2019

Hi @Noel,

Thanks for the response.

For the first part, unfortunately my answer is: this probably depends on your data. Basically you have to pose the questions how long you expect your data to influence your horizon. So do you expect that changes in your data for example two weeks ago, are still effecting your forecast. You could investigate this by for example checking the Auto Correlation of your data. The largest significant lag you expect should be your window size of the Windowing operator (to provide the GBT with this influence factors).
For the training size of the Sliding Window Validation I would try to come up with as much data as possible (I got the impression that in most cases more training data at least does not harm the test performance). There has to be obviously enough data in the testing window to have a proper evaluation of the testing performance. Honestly I don't look much on the training performance, as the testing performance is the one I am interested in.

For the second part: Thank you for reporting. I could reproduce the issue, it seems that the Windowing operator behaves wrongly if the indices attribute is selected at the same time as an time series attribute. I directly start with the fix ;-)

Best regards,
Fabian

New Altair Community Member

Accepted Answer

Oct 22, 2019

Hi @Noel,

Sorry I thought I have answered this already. So I investigated the second issue. It seems that there is (or better was, cause this is fixed in the 9.5.0 version of RM Studio (you can try it out in the BETA version)) a bug, if the indices attribute is also selected as the time series attribute. In your case you used all attributes (including special ones) as time series attributes and the 'date' attribute selected as indices attribute. This caused in the end, that the data was not used correctly and the label attribute (and the windowed date attribute) did not have the right values at all.

If you don't want to upgrade to the beta yet, you can also exclude the 'date' attribute from the time series attribute selection (for example use single and invert, or deselect include special).

For your first issue, I honestly don't look so often at the training performance. All what matters for evaluating the performance is the application on the (independent) test set. And unfortunately most of the time also the "needed" window sizes also depends on your data. So hard to answer your question. But I have some rule of thumbs I can share, maybe they help.

You have basically two different window sizes in your process. The window size of the first Windowing Operator (and of the Process Windows operator), which defines how larger your "profile building" window is. You should try to make it as large as you expect that past effects can influence your forecast. So when you expect that the entries 7 days ago, influence your data your window size should be at least 7. You can investigate this dependency a bit more systematically by using the Auto Correlation operator.
The second size(s) you have to consider are the window sizes of the Sliding Window Validation. Here it is roughly the same situation as in non time series related cases. You want to have a trade off between large training size (to increase the number of examples your GBT is trained on, and therefore normally also the performance), proper test size (to have a statistical proper number of examples to evaluate) and a reasonable runtime (when your step size is small you have a large amount of validation iterations). So in your example you have 1309 Examples for the Sliding Window Validation.
If you want to have a similar behavior as a 10-fold crossvalidation (10 iterations, training set is 9 times the size of the test set) you could use the following settings:
training size: 600
testing size: 70
step size: -1 (for the old Sliding Window Validation, which means same as testing size); 70 for the new Sliding Window Validation
By the way, when you use the new Sliding Window Validation, you could profite from the parallelization of the operator. The old one is not parallelized.

Hopes this helps,
Best regards,
Fabian

Sort by:

1 - 4 of 41

New Altair Community Member

Accepted Answer

Updated Sep 30, 2019 by hughesfleming68

View in context

MartinLiebig

Altair Employee

Accepted Answer

Hi,

as a general point: If training and testing error diverge, you most likely over-trained. Either because of id-ish attributes or because of a too complex model.

Cheers,

Martin

View in context

New Altair Community Member

Accepted Answer

Oct 9, 2019

View in context