Should I use a Red Status Highly Correlated Attribute in Auto Model?
Hi there,
If the results of a Linear Model are too good to be true, I got 0.3% Relative Error on a Linear Model, can I conclude that this has happened because I've included (red status) attributes that are too closely correlated to the Label (Dow Jones closing price)? Should I trust the result?
“High Correlation: a correlation of more than 40% may be an indicator for information you don't have at prediction time. In that case, you should remove this column. Sometimes, however, the prediction problem is simple, and you will get a better model when the column is included. “
For example I included a red status, 2 day moving average which is 99% correlated to the closing price (0.7 weight). If a simple indicator like this, (which is effective when day trading spot forex) is a good predictor — also confirmed according to my Explain Predictions Random Forest model — should I include it? Why is RM Auto Model saying don’t use it? I get the concept that RM is looking for patterns and that it is looking for “underlying reasons” to explain the Label.
Also in the Auto Model help notes it states:
"The performance is calculated on a 40% hold out set which has not been used for any of the performed model optimisations. This hold-out set is then used as input for a multi-hold-out-set validation where we calculate the performance for 7 disjoint subsets. The largest and the highest performance are removed and the average of the remaining 5 performances is reported here."
Do the RM Closing prices not consequentially match the Excel file Column E closing prices because of this disjoint subset testing and is that why there are no dates provided in RM Auto Model results? (The closing price in Excel in row 5186 is 27686, not 27386 as in row 2074 in RM).
Lastly why is the Simulator prediction price so far from the actual current closing price? The Dow Jones is currently at 27778, how do I interpret this 14466 result?
Cheers for any insights,
Answers
-
I do not know the exact nature of your project so my suggestions may be misplaced. However, my experience of working with stock market data indicates that it is very difficult to get good results (if it were simple there would be lots of very rich people walking around and the market would readjust). Getting 0.3% relative error most likely means you are leaking future into your predictors. For example, if you use adjusted closing price to predict closing price, or even if you include some technical features as predictors when they are calculated over the period of time which catches some of the future, or you if include the total market closing figures when you predict an individual stock, etc. So you need to be ware of leaking future, validation, testing data into your training. This is in particular very easy to do when working with time series. I've noted that you used moving averages as predictors, perhaps they were calculated taking some days from the past and some days from the future? Is so, here we have an explanation for your tiny 0.3% relative error!
Jacob0 -
Hi @jacobcybulski,
Thanks for the reply.
In your experience what would you consider to be a good Relative Error rate for Random Forests or ARIMA for Time Series and why is ARIMA classifying indicators predictability different to Random Forests?
Moving averages aren’t predictive indicators and only represent past values.
Do you know how do I get dates to show in Auto Model results column?
Cheers,
0 -
@SkyTrader ARIMA can be effective on non-stationary time series, such as stock time series. The problem is that in the process of smoothing and differencing, you work on severely transformed time series, which can you give you good results. However, once you apply all reverse transformations and add the noise you have removed in the process, the errors get magnified. Try validating your model and calculate errors in real units. Most people who work with financial data insist on making all time series strictly stationary before modelling. On the other hand, you can turn to non-parametric models, which make fewer assumptions on the nature of time series, of which Random Forests or Gradient Boosted Trees are good examples. There is also some recent work on applying Deep Learning models, such as RNN, LSTM and GRU to stock data with very promising results.
Jacob1