Why Doesn't ARIMA Predict Future Time Series Closing Prices?

SkyTrader
SkyTrader New Altair Community Member
edited November 5 in Community Q&A

I’m really hoping someone can explain what’s what when using ARIMA for Time Series Predictions!?

When I’m using ARIMA and this set up:

Pls see images.




I used a huge window size so I could see what was happening chart wise over the last few months. There’s no zoom in on the charts?

At first I thought I was training on the window size of data and then testing on unseen data (I have 20 years of Dow Jones Open/High/Low/Close plus technical indicators from 2000 to 2020). The reason being is because when I put in a very high window size like 4500 days (approx 18 years of data) I would only see about 2 years of charting results from 2018 to present (which I assumes was the test data) whereas if I had a window size of only 60 days I would see a whole chart of 2000 to 2020. 

But... all the relative error figures were very small, like, 1 or 2%, — which is far too good to be true, right? I assume that is because I am training on one subset of data and testing on the same data (as it rolls along using my  window size and step value settings)? 

The questions I have are:

1) How do I make ARIMA test on two different data sets? One seen and one unseen and untrained? With the Cross Validation operator? And if that is the operator I need, how do I ensure ARIMA trains of specific date ranges so I can make it include calm, low volatility periods and also highly volatile periods, like during Covid19?

2) Make ARIMA test on untrained data? 

3) I want the training period to be unanchored and cover the the first 75% of the dataset (2000 to 2015)?

and lastly,

4) How do I get ARIMA to predict the next 2 (or 5, or 10) days ahead of the last date or row of data I have in Excel — which will be 3rd August — when I update my Excel with Yahoo finance data tomorrow night. I.e., so ARIMA will be predicting the closing prices for the 4th and 5th Aug and beyond?

I’ve tried many window size combinations but the low relative errors must be due to the point I raised about not being tested on unseen data.

Even using ARIMA with a small window size of 10 days it doesn’t make a predictions into the future. 

I’m hoping those that are interested in Financial Time Series forecasting will understand these issues!

Thanks very much in advance, 

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,
    i think the operator you are Missing is the Apply Forecast operator. it takes an ARIMA model and forecasts n-points ahead.

    Also keep in mind that the whole validation is used to determine ther performance of the model. You do this do get the correct settings for ARIMA.
    Unlike other ML algorithms you need to "retrain" ARIMA on a new data set if you want to forecast.

    Best,
    MArtin
  • SkyTrader
    SkyTrader New Altair Community Member
    Hi @mschmitz, right thanks for the feedback and answering question 4). I've used the Apply Forecast Operator and connected it but it thinks it's not connected? Pls see images:







    I'm still unsure about testing, training and validation (the latter of which I didn't know I was even doing) in relation to ARIMA and I still don't understand what the answers are to Qu's 1, 2 and 3?  

    "the whole validation" -- what part of my process was "validating?" I thought I was just training and testing on the whole data set and is this why I got amazing low relative error statistics?

    "you need to "retrain" ARIMA on a new data set if you want to forecast." -- how do I do that please?

    Thanks very much for any advice, as I thought this would be a lot simpler but tbh I find the Help in RM (on the right) very hard to understand for beginners, this despite having 4 years of algorithmic trading experience! It's like the Help is talking to people who already understand everything already.

    Cheers,
    Best,
    Sky Trader
  • tftemme
    tftemme New Altair Community Member
    Hi @SkyTrader

    Training your data set on a training set and testing the model on an unseen test set is called validation in general. This is true for "normal" machine learning applications as well as for time series problems. There are some difference, one important one is how the training and test sets are set up. In a "normal" machine learning use case, you use most of the time Cross Validation, for a time series problem you want to use Sliding Window Validation. Later is achieved by the Forecast Validation operator.

    So your initial setup was already correct for the validation (testing the performance of your model on unseen data). Concerning your questions:

    1) The Forecast Validation operator directly trains a Forecast Model (in your case ARIMA) in the 'Training' subprocess (the left part of the inner process). This Forecast Model is then used to predict the test window in the 'Testing' subprocess and a performance measure is calculated (in your case by the Performance operator). The important output of the Forecast Validation operator is the evaluated performance of your model and the final model, which is trained on the whole input data.

    2) see 1), the Forecast Validation operator does this automatically

    3) Set the window size to 75 % of your input data. For now, window sizes can only be configured in number of examples

    4) Use the final model of the Forecast Validation operator (the top output port) and connect it to an Apply Forecast operator outside of the Forecast Validation operator. Be aware that you cleary have not connected the Apply Forecast operator in the screenshots of your second post. The operator is only placed on top of the line, the line itself is not connected

    Hopes this helps,
    Best regards,
    Fabian

    PS.: I would recommend to go through the in-Product tutorials (click on the 'Learn' tab in the Welcome panel) to get familiar with concepts of RapidMiner and data science. Though it is not exactly directed to time series, it helps getting familiar with the product
    PPS.: When you try to figure out how operators work, you can also check out the tutorial processes at the end of the help of the operators
  • tftemme
    tftemme New Altair Community Member
    Hi, 

    I try to answer the questions shortly:

    - Forecast Validation is executed normally in parallel, so the order of the test results depends on which of the windows are executed at first (which is internally processing). You can either disable the parallel execution, or add an Sort operator after it
    - The resulting ExampleSet of the Apply Forecast operators starts either at the beginning of the input time series data (if you have enabled the corresponding parameter) which is used to train the Forecast Model, or it only holds the forecasted values. So in your case the model which is provided to the Apply Forecast operator is trained on the data starting in September. You can insert breakpoints before and after operators (right click on the operators) to go into details where it might behave differently then you expect
    - I have added an answer to the other post about the gaps
    - When you use Apply Forecast it obviously does not have predictions for the training data (missing values) and it does not have real values for the predicted values (in the future). When you use Forecast Validation, the test window contains the predicted values and the real values of the test window, but this is a different situation than using Apply Forecast to predict unknown values in the future
    - Most of the screenshots you have show the results of the Apply Forecast operator. The number of forecasted values by this operator is just defined by the corresponding parameter of the Apply Forecast operator
    - Forecast Model try to predict the future based on past values. It can happen that the best forecast is just a flat line, because there is no proper pattern in the data, and the "zig-zag" in your input data is just noise which cannot be predicted.
    You can try to do an Optimization to find the best parameter setting for the ARIMA model to have the "best" prediction (in terms of the performance measure you use in the Validation)

    Best regards,
    Fabian

  • SkyTrader
    SkyTrader New Altair Community Member
    Thanks Fabian, @tftemme

    My Horizon has been fixed at 20 (to try and replicate the first ARIMA Apply Forecast consecutive daily results in August). Step Size is 1.

    Turning off parallel execution still brought up the 2020 data first then 2005 data going forward in Forecast Validation. I'm using 75% of my 5200 rows of data (2000 to 2020) which is 3900 rows -- Window Size and Step Size of 1. Why would Forecast Validation produce results from 2005 onwards (after its still reproduced the top rows of 2020 data first), surely I'm training on 2000 to 2015 (75% of data) and Forecast Validation should start from 2015?

    I am baffled why using many of my standard combinations of Window and Step Sizes (but always with Horizon at 20) I cannot get it to reproduce Apply Forecast results consecutively like I did when I first started using ARIMA last week, and which produced results that didn't skip every 3 days even though I would have tested it and ran the ARIMA model on the same data set (last date 29th July 2020)?

    Changing Step from 1 to 100, I then tried using a Sort operator and that didn't fix the issues of seeing 2020 data first at the top of Forecast Validation results so I deleted Sort went back to Step Size 1 and ran it again and now it is showing Forecast Validation starting from 2015 (as it should be, albeit with 2020 results still at the top) and not 2005, so... I am wondering why changing the Step from 1 to 100 and back to 1, does it now produce the correct results from 2015 onwards?

    " The resulting ExampleSet of the Apply Forecast operators starts either at the beginning of the input time series data (if you have enabled the corresponding parameter) which is used to train the Forecast Model,"

    Which parameter is that please?

    "
    When you use Forecast Validation, the test window contains the predicted values and the real values of the test window, but this is a different situation than using Apply Forecast to predict unknown values in the future"

    I'm interested in getting those future predictions using Apply Forecast, why with Window at 3900, Step at 1 and Horizon at 20 (to try and replicate the first ARIMA Apply Forecast consecutive daily results in August) can I never get Apply Forecast to go beyond 20th July? I want results for a 20 day horizon from the 29th July 2020 onwards. A Step of 1 should accommodate that, no? What am I missing here?

    "
    Most of the screenshots you have show the results of the Apply Forecast operator. The number of forecasted values by this operator is just defined by the corresponding parameter of the Apply Forecast operator"

    So in summary, I've set Apply Forecast to be for a horizon of 20, I am still unclear why it is not giving those values going forward into August 2020 from the end of my data (29th July 2020)? I wish I'd written down the Window and Step Size when I got it to give a perfect daily forecast on consecutive days starting in August last week...

    I looked at auto arima in the Samples/Time Series/templates/Automized Arima on US-Consumption data.

    I added an Optimise Grid Operator but can't understand why the Operator parameter filed is unresponsive or how to use the wizard? Pls see image:




    Thanks once again for your input.

    Best,
    Sky Trader.
  • tftemme
    tftemme New Altair Community Member
    Hi @SkyTrader

    When you configure the Sort operator correctly (by sorting after the corresponding Date attribute) it cannot happen that the 2020 data occurs on top, so there has to be a wrong configuration by yourself.

    As I already said, windowing is always based on number of examples. You can count by yourself. Maybe draw an own example by your own, to better understand the way windowing is working. If you have a better understanding of windowing, you will figure out as well how the windowing is working on your data and which combinations of window size and step size have what effect.
    (Breakpoints help to understand specific steps, because you can directly see the data before and after the step)

    Honestly there are 3 parameters in Apply Forecast. One controls the forecast length, the other two are called "add original time series" and "add combined time series". They are even described in the help text.

    As I said, the configuration of the Forecast Validation has no influence on the predicted values for the future. It is just use to evaluate the performance of the Forecast Model. 
    The Forecast Model which is used by the Apply Forecast operator uses the whole input data as the training data (as it is described in the help text). So the time difference between the predicted values is just based on the time difference of your last two value in the input time series. The number of predicted examples is based on the forecast length parameter.

    Have you placed anything inside the Optimize Parameter operator? Please have a look into the help text of the operator and the tutorial processes. 

    In general I would recommend to go through the in-product tutorials ("Learn" tab in the welcome dialog at the start of RapidMiner)
    Also please study the help text and tutorial processes of operators in more detail.

    Best regards,
    Fabian
  • SkyTrader
    SkyTrader New Altair Community Member
    edited August 2020
    Hi Fabian @tftemme

    Cheers, yes I'm aware of the parameters in the Apply Forecast and what they do. I prefer to just tick the first box (add orig time series) and leave the second unticked for a cleaner results table.

    I'm familiar with windowing in that I've typically used weekly, monthly of quarterly Step Sizes based on how trading companies like hedge funds and EFT's will change their portfolio of stocks, eg after quarterly performance is measured.
    Forecast Horizon seems self explanatory. I've used Walk Forward optimisations a lot in my algo trading which uses a similar concept and allows for anchored (best) or unanchored optimisation of data. 

    "The Forecast Model, which is used by the Apply Forecast operator, uses the whole input data as the training data (as it is described in the help text)."

    "uses the whole input data":
    But isn't Window size what splits the dataset into Training and Test sizes?
    (It seems sensible to have at least half the data and certainly best to be able to cover different market regimes (volatile, non volatile, trending, non trending periods). How accurate those quarterly time delineations/Step Sizes will be with "missing" dates for weekend is still something I'm figuring out). 

    "So the time difference between the predicted values is just based on the time difference of your last two value in the input time series.
    "

    Right, this was the issue with there being 3 days between the two inputs and therefore ending up with skipped dates forecasts eg, 1st, 4th, 7th August 2020 etc.

    Maybe I am missing the point here, again, but (and dependent upon the size of the Step, so it's better to have a value like 1 or 5), if the data ends 29th July and the Apply Forecast Horizon is for 20 days and the Step is small then future predictions in August can be made.

    That's why I thought I had everything sorted with the ARIMA model last week until I went back to it and couldn't replicate those future consecutive August predictions using a myriad of Window and Step size values (pls see my other post).

    Right, I'll take another look at break points but I still feel like something is not working right (my set up) and that is what is confusing me so much because as you said, ARIMA doesn't have a lot of parameters to alter.

    "Have you placed anything inside the Optimize Parameter operator?"

    I haven't got that far because as mentioned above I don't know how to make the wizard work.

    Best regards,
    Sky Trader.