Univariate Forecasting
wendy
New Altair Community Member
Hello all,
I'm new to RapidMiner and having a problem with arranging my operators to carry out a univariate time series forecasting. I need some help.
Here, I have a dataset consists of one attribute, i.e. the amount of beer production each month. There are approximately 476 rows in the dataset and each row represents the beer production at 1 month. So I divided this dataset manually to 70% and 30% for training and testing respectively. After that, I prepared the operators in RapidMiner as follows:
I realized that there's something wrong with my testing part (that begins with ExcelSampleSource, ModelLoader, and ModelApplier). First, I'm supposed to forecast the values for the next 30% of the original dataset, but this testing dataset already contains the actual values. I actually need to compare the forecasting results with these values at the end.
I'm so confused about this. Should I actually not divide my original dataset? How do I let RapidMiner learn the model by feeding the first 70% of the data so that it can produce forecast values of the following 30% of the data?
I'm sorry if there are any unclarity in my writing. I've tried to search the archive on this forum regarding this matter and I still don't understand. I'll be very grateful if anyone could help. (sorry for the very long post ;p)
Thanks in advance,
Wendy
I'm new to RapidMiner and having a problem with arranging my operators to carry out a univariate time series forecasting. I need some help.
Here, I have a dataset consists of one attribute, i.e. the amount of beer production each month. There are approximately 476 rows in the dataset and each row represents the beer production at 1 month. So I divided this dataset manually to 70% and 30% for training and testing respectively. After that, I prepared the operators in RapidMiner as follows:
- Applying Series2WindowExamples operator in order to apply windowing.
- Let an algorithm (such as NeuralNet, LibSVMLearner, etc) to produce model based on the training data. This is achieved in a cross-validation scheme.
- Thinking that I should get a correct model from the above steps, I load my testing dataset (which is 30% in portion). Then I called the stored model to be applied on that testing data.
Now that I come to think of it, I believe that I have arranged the operators wrongly, but I don't know the correct way to do it. Regarding the cross-validation operator, I've read some threads in the forum, and I found out that this could lead into training the model falsely using values that come after the forecast values. Guess I would have to use SlidingWindowValidation instead.
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\Documents and Settings\Wendy\Desktop\newelec.xls"/>
<parameter key="sheet_number" value="2"/>
</operator>
<operator name="Series2WindowExamples" class="Series2WindowExamples">
<parameter key="series_representation" value="encode_series_by_examples"/>
<parameter key="window_size" value="10"/>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="number_of_validations" value="2"/>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="NeuralNet" class="NeuralNet">
<list key="hidden_layer_types">
</list>
<parameter key="training_cycles" value="1000"/>
<parameter key="learning_rate" value="0.7"/>
<parameter key="momentum" value="0.7"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="C:\Documents and Settings\Wendy\Desktop\newelec_model.mod"/>
</operator>
</operator>
<operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="Performance" class="Performance">
</operator>
</operator>
</operator>
<operator name="ExcelExampleSource (2)" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\Documents and Settings\Wendy\Desktop\newelec.xls"/>
<parameter key="sheet_number" value="3"/>
</operator>
<operator name="ModelLoader" class="ModelLoader">
<parameter key="model_file" value="C:\Documents and Settings\Wendy\Desktop\newelec_model.mod"/>
</operator>
<operator name="ModelApplier (2)" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
</operator>
I realized that there's something wrong with my testing part (that begins with ExcelSampleSource, ModelLoader, and ModelApplier). First, I'm supposed to forecast the values for the next 30% of the original dataset, but this testing dataset already contains the actual values. I actually need to compare the forecasting results with these values at the end.
I'm so confused about this. Should I actually not divide my original dataset? How do I let RapidMiner learn the model by feeding the first 70% of the data so that it can produce forecast values of the following 30% of the data?
I'm sorry if there are any unclarity in my writing. I've tried to search the archive on this forum regarding this matter and I still don't understand. I'll be very grateful if anyone could help. (sorry for the very long post ;p)
Thanks in advance,
Wendy
Tagged:
0
Answers
-
Hello Wendy,
using the SlidingWindowValidation is definetely the better evaluation for uni-variate time series forecasting than a single split into training and test set. In other words, there is no need for a 70:30 training:test set split.
The SlidingWindowValidation simulates the time and moves the evaluation window over the data, always using the data inside the training window (past data within the simulation) for training and the next test window on the time series for testing and evaluation. Inside the training window of the SlidingWindowValidation, you should use a Series2WindowExamples or a MultiVariateSeries2WindowExamples operator to create the training examples for your regression learner, e.g. a neural net or an SVM or a linear regression or so. Since training and test data need to represented the same way, you should also use the corresponding windowing with Series2WindowExamples and MultiVariateSeries2WindowExamples, respectively, on the test set, i.e. on the data within the test window.
For more information about and examples of uni-variate and multi-variate time series prediction set-ups, I can recommend the Training Course "Time Series Analysis with Statistical Methods" and the
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource" breakpoints="after">
<parameter key="attributes" value="data/standard_and_poors_end_of_day_value_time_series.aml"/>
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole">
<parameter key="name" value="label"/>
</operator>
<operator name="SlidingWindowValidation" class="SlidingWindowValidation" expanded="yes">
<parameter key="training_window_width" value="400"/>
<parameter key="training_window_step_size" value="200"/>
<parameter key="test_window_width" value="1"/>
<operator name="TrainingChain" class="OperatorChain" expanded="yes">
<operator name="TrainingWindowing" class="Series2WindowExamples">
<parameter key="series_representation" value="encode_series_by_examples"/>
<parameter key="window_size" value="20"/>
</operator>
<operator name="JMySVMLearner" class="JMySVMLearner">
<parameter key="C" value="1.0"/>
</operator>
</operator>
<operator name="TestingChain" class="OperatorChain" expanded="yes">
<operator name="TestingWindowing" class="Series2WindowExamples">
<parameter key="series_representation" value="encode_series_by_examples"/>
<parameter key="window_size" value="20"/>
</operator>
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="RegressionPerformance" class="RegressionPerformance">
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
<parameter key="root_relative_squared_error" value="true"/>
</operator>
</operator>
</operator>
</operator>
Webinar "Time Series Analysis and Forecasts with RapidMiner".
Best regards,
Ralf0 -
Hi Ralf,
Thanks so much for your reply! It really helps me to better understand the windowing approach. By the way, the training course and the webinar are just too costly for a student like me. Besides, my location is in Asia .
I'm a little bit curious about this ChangeAttributeRole operator that you used. What does it do, exactly? It changes the attribute type, as written, but what is the implication? Oh yeah, in this scheme, do I have to set the same values for horizon, window_size and step_size for both Series2WindowExamples in training and testing stage?
I have a question for the SlidingWindowValidation operator. Suppose there are 476 rows in my dataset. I want RapidMiner to use the first 337 rows for training, and use the rest (139 rows) to be forecast. Initially, I set up the parameters of SlidingWindowValidation as follows:- training_window_width = 337
- training_window_step_size = 1
- test_window_width = 139
- horizon = 1
However, I realized that in the testing stage, RapidMiner starts forecasting the value of row number 358 onwards, because the test window starts at row 338 (window size: 20).
In order to solve this, should I set the training_window_width = 317 and test_window_width = 159? But then, the training could miss some of the rows for learning (esp. row 318 to 337), because they are used to forecast the value at row 338. Can anyone help me to answer this?
I'm sorry for asking a lot of questions.
Thanks again,
Wendy0 -
Hi Wendy,
webinars can be attended from any computer world-wide with an internet connection and some of the introductory webinars are for free.wendy wrote:
By the way, the training course and the webinar are just too costly for a student like me. Besides, my location is in Asia .
Setting the attribute role of a (time series) attribute to label tells RapidMiner to use this attribute (time series) as the one to be predicted in the forecasts. In case of multi-variate times serieses, you may have many input time serieses that the model can be built upon, but typically you intent to predict one particular time series based on these multiple time serieses. Correspondingly, you mark the target time series to be the label.wendy wrote:
I'm a little bit curious about this ChangeAttributeRole operator that you used. What does it do, exactly? It changes the attribute type, as written, but what is the implication?
Yes, because otherwise the trained model does not fit to the test data, when you apply the model. The window length and all other potential pre-processing steps have to be identical between training and test. Otherwise the particular model is not appropriate.wendy wrote:
Oh yeah, in this scheme, do I have to set the same values for horizon, window_size and step_size for both Series2WindowExamples in training and testing stage?
Best regards,
Ralf0 -
Thanks for the info! And many thanks for your help, Ralf! This is my first post here and I already feel that you guys are so welcoming and supporting, even to a newbie like me. Would try to visit more often .webinars can be attended from any computer world-wide with an internet connection and some of the introductory webinars are for free. Wink
Regards,
Wendy0 -
I am a new comer into this world. I tried installing Rapidminer 5.0, and is able to run XValidation OK. I find problems with forecasting. I do not find any of MultiVariateSeries2WindowExamples or Windows* operators available in my installation. I am using the freeware version (not the enterprise version). Can you kindly let me know if these are not accessible from the non-enterprise paid version at all?
If not, how can i make these operators usable for my models in Rapidminer Beta 5.0?
Regards,
Partha0 -
Hi Partha,
we have a Time Series Extension and decided to include these operators into the extension, rather than diminishing everything, having a part of the operators in the core and the remaining rest in the extension. So you will have to install the Time Series Extension for getting access to these operators. It will be available for download with the new RapidMiner 5.0 release this week.
Greetings,
Sebastian0 -
Hi Partha,
in other words: The RapidMiner time series data mining processes I posted earlier work fine with RapidMiner 4.6 and its value series plugin as they are now and they will also work fine with the RapidMiner 5 version and its time series extension that will be released this week.
Best regards,
Ralf0