Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Univariate Forecasting

Hello all,

I'm new to RapidMiner and having a problem with arranging my operators to carry out a univariate time series forecasting. I need some help.

Here, I have a dataset consists of one attribute, i.e. the amount of beer production each month. There are approximately 476 rows in the dataset and each row represents the beer production at 1 month. So I divided this dataset manually to 70% and 30% for training and testing respectively. After that, I prepared the operators in RapidMiner as follows:

Applying Series2WindowExamples operator in order to apply windowing.
Let an algorithm (such as NeuralNet, LibSVMLearner, etc) to produce model based on the training data. This is achieved in a cross-validation scheme.
Thinking that I should get a correct model from the above steps, I load my testing dataset (which is 30% in portion). Then I called the stored model to be applied on that testing data.

My xml code looks more or less like this:


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExcelExampleSource" class="ExcelExampleSource">
        <parameter key="excel_file"	value="C:\Documents and Settings\Wendy\Desktop\newelec.xls"/>
        <parameter key="sheet_number"	value="2"/>
    </operator>
    <operator name="Series2WindowExamples" class="Series2WindowExamples">
        <parameter key="series_representation"	value="encode_series_by_examples"/>
        <parameter key="window_size"	value="10"/>
    </operator>
    <operator name="XValidation" class="XValidation" expanded="yes">
        <parameter key="number_of_validations"	value="2"/>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="NeuralNet" class="NeuralNet">
                <list key="hidden_layer_types">
                </list>
                <parameter key="training_cycles"	value="1000"/>
                <parameter key="learning_rate"	value="0.7"/>
                <parameter key="momentum"	value="0.7"/>
            </operator>
            <operator name="ModelWriter" class="ModelWriter">
                <parameter key="model_file"	value="C:\Documents and Settings\Wendy\Desktop\newelec_model.mod"/>
            </operator>
        </operator>
        <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="Performance" class="Performance">
            </operator>
        </operator>
    </operator>
    <operator name="ExcelExampleSource (2)" class="ExcelExampleSource">
        <parameter key="excel_file"	value="C:\Documents and Settings\Wendy\Desktop\newelec.xls"/>
        <parameter key="sheet_number"	value="3"/>
    </operator>
    <operator name="ModelLoader" class="ModelLoader">
        <parameter key="model_file"	value="C:\Documents and Settings\Wendy\Desktop\newelec_model.mod"/>
    </operator>
    <operator name="ModelApplier (2)" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
</operator>

Now that I come to think of it, I believe that I have arranged the operators wrongly, but I don't know the correct way to do it. Regarding the cross-validation operator, I've read some threads in the forum, and I found out that this could lead into training the model falsely using values that come after the forecast values. Guess I would have to use SlidingWindowValidation instead.

I realized that there's something wrong with my testing part (that begins with ExcelSampleSource, ModelLoader, and ModelApplier). First, I'm supposed to forecast the values for the next 30% of the original dataset, but this testing dataset already contains the actual values. I actually need to compare the forecasting results with these values at the end.

I'm so confused about this. Should I actually not divide my original dataset? How do I let RapidMiner learn the model by feeding the first 70% of the data so that it can produce forecast values of the following 30% of the data?

I'm sorry if there are any unclarity in my writing. I've tried to search the archive on this forum regarding this matter and I still don't understand. I'll be very grateful if anyone could help. (sorry for the very long post ;p)

Thanks in advance,
Wendy

Find more posts tagged with

AI Studio

Accepted answers

All comments

RalfKlinkenberg

Hello Wendy,

using the SlidingWindowValidation is definetely the better evaluation for uni-variate time series forecasting than a single split into training and test set. In other words, there is no need for a 70:30 training:test set split.

The SlidingWindowValidation simulates the time and moves the evaluation window over the data, always using the data inside the training window (past data within the simulation) for training and the next test window on the time series for testing and evaluation. Inside the training window of the SlidingWindowValidation, you should use a Series2WindowExamples or a MultiVariateSeries2WindowExamples operator to create the training examples for your regression learner, e.g. a neural net or an SVM or a linear regression or so. Since training and test data need to represented the same way, you should also use the corresponding windowing with Series2WindowExamples and MultiVariateSeries2WindowExamples, respectively, on the test set, i.e. on the data within the test window.


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource" breakpoints="after">
        <parameter key="attributes"	value="data/standard_and_poors_end_of_day_value_time_series.aml"/>
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
        <parameter key="name"	value="label"/>
    </operator>
    <operator name="SlidingWindowValidation" class="SlidingWindowValidation" expanded="yes">
        <parameter key="training_window_width"	value="400"/>
        <parameter key="training_window_step_size"	value="200"/>
        <parameter key="test_window_width"	value="1"/>
        <operator name="TrainingChain" class="OperatorChain" expanded="yes">
            <operator name="TrainingWindowing" class="Series2WindowExamples">
                <parameter key="series_representation"	value="encode_series_by_examples"/>
                <parameter key="window_size"	value="20"/>
            </operator>
            <operator name="JMySVMLearner" class="JMySVMLearner">
                <parameter key="C"	value="1.0"/>
            </operator>
        </operator>
        <operator name="TestingChain" class="OperatorChain" expanded="yes">
            <operator name="TestingWindowing" class="Series2WindowExamples">
                <parameter key="series_representation"	value="encode_series_by_examples"/>
                <parameter key="window_size"	value="20"/>
            </operator>
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="RegressionPerformance" class="RegressionPerformance">
                <parameter key="root_mean_squared_error"	value="true"/>
                <parameter key="absolute_error"	value="true"/>
                <parameter key="relative_error"	value="true"/>
                <parameter key="root_relative_squared_error"	value="true"/>
            </operator>
        </operator>
    </operator>
</operator>

For more information about and examples of uni-variate and multi-variate time series prediction set-ups, I can recommend the Training Course "Time Series Analysis with Statistical Methods" and the
Webinar "Time Series Analysis and Forecasts with RapidMiner".

Best regards,
Ralf

wendy

Hi Ralf,

Thanks so much for your reply! It really helps me to better understand the windowing approach. By the way, the training course and the webinar are just too costly for a student like me. Besides, my location is in Asia

.

I'm a little bit curious about this ChangeAttributeRole operator that you used. What does it do, exactly? It changes the attribute type, as written, but what is the implication? Oh yeah, in this scheme, do I have to set the same values for horizon, window_size and step_size for both Series2WindowExamples in training and testing stage?

I have a question for the SlidingWindowValidation operator. Suppose there are 476 rows in my dataset. I want RapidMiner to use the first 337 rows for training, and use the rest (139 rows) to be forecast. Initially, I set up the parameters of SlidingWindowValidation as follows:

training_window_width = 337
training_window_step_size = 1
test_window_width = 139
horizon = 1

(note: the window size for both Series2WindowExamples in the training and testing are 20)
However, I realized that in the testing stage, RapidMiner starts forecasting the value of row number 358 onwards, because the test window starts at row 338 (window size: 20).

In order to solve this, should I set the training_window_width = 317 and test_window_width = 159? But then, the training could miss some of the rows for learning (esp. row 318 to 337), because they are used to forecast the value at row 338. Can anyone help me to answer this?

I'm sorry for asking a lot of questions.

Thanks again,
Wendy

RalfKlinkenberg

Hi Wendy,

wendy wrote:

By the way, the training course and the webinar are just too costly for a student like me. Besides, my location is in Asia .

webinars can be attended from any computer world-wide with an internet connection and some of the introductory webinars are for free.

wendy wrote:

I'm a little bit curious about this ChangeAttributeRole operator that you used. What does it do, exactly? It changes the attribute type, as written, but what is the implication?

Setting the attribute role of a (time series) attribute to label tells RapidMiner to use this attribute (time series) as the one to be predicted in the forecasts. In case of multi-variate times serieses, you may have many input time serieses that the model can be built upon, but typically you intent to predict one particular time series based on these multiple time serieses. Correspondingly, you mark the target time series to be the label.

wendy wrote:

Oh yeah, in this scheme, do I have to set the same values for horizon, window_size and step_size for both Series2WindowExamples in training and testing stage?

Yes, because otherwise the trained model does not fit to the test data, when you apply the model. The window length and all other potential pre-processing steps have to be identical between training and test. Otherwise the particular model is not appropriate.

Best regards,
Ralf

wendy

webinars can be attended from any computer world-wide with an internet connection and some of the introductory webinars are for free. Wink

Thanks for the info! And many thanks for your help, Ralf! This is my first post here and I already feel that you guys are so welcoming and supporting, even to a newbie like me. Would try to visit more often

.

Regards,
Wendy

abahanandipa

I am a new comer into this world. I tried installing Rapidminer 5.0, and is able to run XValidation OK. I find problems with forecasting. I do not find any of MultiVariateSeries2WindowExamples or Windows* operators available in my installation. I am using the freeware version (not the enterprise version). Can you kindly let me know if these are not accessible from the non-enterprise paid version at all?

If not, how can i make these operators usable for my models in Rapidminer Beta 5.0?

Regards,
Partha

land

Hi Partha,
we have a Time Series Extension and decided to include these operators into the extension, rather than diminishing everything, having a part of the operators in the core and the remaining rest in the extension. So you will have to install the Time Series Extension for getting access to these operators. It will be available for download with the new RapidMiner 5.0 release this week.

Greetings,
Sebastian

RalfKlinkenberg

Hi Partha,

in other words: The RapidMiner time series data mining processes I posted earlier work fine with RapidMiner 4.6 and its value series plugin as they are now and they will also work fine with the RapidMiner 5 version and its time series extension that will be released this week.

Best regards,
Ralf