Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

how to predict?

i;m new at RM.

if i have the following attribute:
ticker , date, close

how i can predict close price for next day? or next week?

BR

Find more posts tagged with

AI Studio

Accepted answers

All comments

wessel

1. save the text below as an xml file and open in in RM
2. click on "CSVExampleSource", and change it so loads your own dataset.
3. My dataset also uses the class attribute label: "close" so this should be the same.
If not chance all occurrences of close to your own label.
4. Change W-REPTree to any regression learning algorithm that suits you.

#### example.xml ####
<operator name="Root" class="Process" expanded="yes">
<operator name="CSVExampleSource" class="CSVExampleSource">
<parameter key="filename" value="D:\wessel\Desktop\testBook1.csv"/>
</operator>
<operator name="MultivariateSeries2WindowExamples (2)" class="MultivariateSeries2WindowExamples">
<parameter key="window_size" value="2"/>
<parameter key="label_attribute" value="close"/>
<parameter key="add_incomplete_windows" value="true"/>
</operator>
<operator name="ChangeAttributeName" class="ChangeAttributeName">
<parameter key="old_name" value="label"/>
<parameter key="new_name" value="close"/>
</operator>
<operator name="FeatureNameFilter" class="FeatureNameFilter">
<parameter key="skip_features_with_name" value="close-0"/>
</operator>
<operator name="FixedSplitValidation" class="FixedSplitValidation" expanded="yes">
<parameter key="training_set_size" value="30"/>
<operator name="W-REPTree" class="W-REPTree">
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="RegressionPerformance" class="RegressionPerformance">
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
</operator>
</operator>
</operator>
</operator>

ahanazi

thanks a lot.

can you explain where is exacly you set the prediction ittem?and how can i control it, i mean how i predict one day or 2 days or three days??
why other attrbute give open-0, open-1...what does this mean?

i hope i'm not bothering you as i'm new for RM.

BR

haddock

Chaps,

Please be very careful about the validation operator you use, with what you have you can easily be training on examples that occur after the examples you are testing! It makes more sense to train on the past, or am I missing something? I think I brought this up quite recently, http://rapid-i.com/rapidforum/index.php/topic,908.msg3395.html#msg3395 , and amazingly here as well http://rapid-i.com/rapidforum/index.php/topic,954.msg3593.html#msg3593 so I'm probably wasting my time bringing it up again.

wessel

Yes, very good point Haddock.

But are we really training on future examples, and testing on past examples?

Let's say we have a simple dataset, made by hand.
It consists of 3 numerical attributes:
t: "hour of the day"
u: "barometer"
v: "wind speed"
And it only has 8 example instances.

## Dataset
t u v
--------------
t1 u1 v1
t2 u2 v2
t3 u3 v3
t4 u4 v4
t5 u5 v5
t6 u6 v6
t7 u7 v7
t8 u8 v8

## Forecasting v9
Let's say we wish to predict the wind speed at the next hour, so the value of v+1.
We already know the value of t+1, since time or "hour of the day" is a fully determinative attribute.
t+1 = t+0 + 1; if(t+1 == 24) {t+1 = 0}
Unlike the value of u+1, since we cannot look ahead into the future and see what our barometer looks like.

## Start simple
So we could start simple, throw away barometer for now, and see how well we can forecast v+1 using just t. Intuitively I would do a “keep order” 66% split to test the performance.
So train on:
t1 v1
t2 v2
t3 v3
t4 v4
t5 v5

And test on:
t6 .
t7 .
t8 .

But I guess training on a “random order” 66% split would work just as well.
So train on: (randomly selected 6,4,8,1,2)
t6 v6
t4 v4
t8 v8
t1 v1
t2 v2

And test on:
t7 .
t5 .
t3 .

Or am I making some thinking error here?

## A bit harder
Now when we make the problem a bit harder, by adding barometer information, I think this still holds.

Let’s say we take window size of 2, so we are adding the information of the barometer readings from the pervious hour u-1. Then our dataset would look like this:

t-0 u-0 v-0 t-1 u-1 v-1
--------------------------------------------
t1 u1 v1 ? ? ?
t2 u2 v2 t1 u1 v1
t3 u3 v3 t2 u2 v2
t4 u4 v4 t3 u3 v3
t5 u5 v5 t4 u4 v4
t6 u6 v6 t5 u5 v5
t7 u7 v7 t6 u6 v6
t8 u8 v8 t7 u7 v7

## Throw away
I don’t think adding a training example with ? adds any useful information. So I think its best to just throw it away. But maybe I’m wrong here.
Secondly we can’t use training examples of the form:
t-0 u-0 v-0 t-1 u-1 v-1
Because u-0 is future information.
(t-0 isn’t because t is fully deterministic)
So we have to throw u-0 away.
But since t-0 and t-1 are correlated 100% we might as well also throw it away, without loosing any information. So then we end up with a new dataset:
v-0 t-1 u-1 v-1
---------------------------
v2 t1 u1 v1
v3 t2 u2 v2
v4 t3 u3 v3
v5 t4 u4 v4
v6 t5 u5 v5
v7 t6 u6 v6
v8 t7 u7 v7

Now does it matter in what order you feed your training examples to your learning algorithm? Well that’s a hard question actually. I think it does, depending on the learning algorithm.

And I guess you could test this quite easily.
Evaluate the performance of your learning algorithm 3 times, using cross validation, random order percentage split, and fixed order percentage split. And see if there is a significant difference.

haddock

Hi Wessel

But are we really training on future examples, and testing on past examples?

I think my point was different ...

with what you have you can easily be training on examples that occur after the examples you are testing!

So let me try again. Here is an example of what I meant, it is simply the Fixed Split Sample with ID's and breakpoints inserted, so you can see the numbers of the examples that are being used for both training and test. Just check the highest training ID against the lowest test ID. If the highest trainer exceeds the lowest tester then training is taking place on examples that occur after a test case. With fixed split validation, such as you presented, this is inevitable, by definition.

While this doesn't matter if the domain, like the study of the humble Iris, exhibits constant properties, it is less clear that it is in any way appropriate with closing prices. Surely guessing 10 days hence is easier if you have information about day 9?

Anyways, here's the demo.

<operator name="Root" class="Process" expanded="yes">
    <operator name="ArffExampleSource" class="ArffExampleSource">
        <parameter key="data_file"	value="../data/iris.arff"/>
        <parameter key="label_attribute"	value="class"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="FixedSplitValidation" class="FixedSplitValidation" expanded="yes">
        <operator name="NearestNeighbors" class="NearestNeighbors" breakpoints="before">
        </operator>
        <operator name="ApplierChain" class="OperatorChain" expanded="yes">
            <operator name="Test" class="ModelApplier" breakpoints="after">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="Performance" class="Performance">
            </operator>
        </operator>
    </operator>
</operator>

wessel

Hmm I think what you are trying to say is:

Dataset:
v-0 t-1 u-1 v-1
---------------------------
v2 t1 u1 v1
v3 t2 u2 v2
v4 t3 u3 v3
v5 t4 u4 v4
v6 t5 u5 v5
v7 t6 u6 v6
v8 t7 u7 v7

Good:
Sampling_type == "linear"
v2 t1 u1 v1 TRAIN
v3 t2 u2 v2
v4 t3 u3 v3
v5 t4 u4 v4
v6 t5 u5 v5

v7 t6 u6 v6 TEST
v8 t7 u7 v7

Bad:
Sampling_type == "shuffled"
v2 t1 u1 v1 TRAIN
v8 t7 u7 v7
v5 t4 u4 v4
v4 t3 u3 v3
v6 t5 u5 v5

v7 t6 u6 v6 TEST
v3 t2 u2 v2

I don't think this is either GOOD or BAD.
I'm not sure for what learners the order of trainingsexamples matters.
I guess it doesn't matter for tree's, nearest neighbour, linear regression, Bayesian networks, because they take an entire exampleset as input.
I guess it does matter for neural networks, updatable Baysian networks, because they iterate over the example set.

Of course it ALWAYS matters when your trying to learn a non-static target function.
Let's say we're trying to predict the temperature for the next day.
Since the earth is getting hotter and hotter its no good learning from data from a 1000 year ago.
But then you just have to throw away the old data I guess.

haddock

Hmm I think what you are trying to say is:

I think what I'm trying to say is...

Please be very careful about the validation operator you use, with what you have you can easily be training on examples that occur after the examples you are testing! It makes more sense to train on the past, or am I missing something? I think I brought this up quite recently, http://rapid-i.com/rapidforum/index.php/topic,908.msg3395.html#msg3395 ; , and amazingly here as well http://rapid-i.com/rapidforum/index.php/topic,954.msg3593.html#msg3593 so I'm probably wasting my time bringing it up again.

Or putting it more succinctly, if you want credible results in economic time-series forecasting use sliding window validations.

wessel

How does sliding window validation work?
If it makes generates testing examples the same way as "MultivariateSeries2WindowExamples" then it should be the same?

The same as:
Good:
Sampling_type == "linear"
v2 t1 u1 v1 TRAIN
v3 t2 u2 v2
v4 t3 u3 v3
v5 t4 u4 v4
v6 t5 u5 v5

v7 t6 u6 v6 TEST
v8 t7 u7 v7

haddock

If it makes generates testing examples the same way as "MultivariateSeries2WindowExamples" then it should be the same?

But it doesn't, it has a lookback horizon, as well as a lookforward horizon.

wessel

Okay, I agree, a sliding window validation would be ideal.

But let's say I have 1200 hours of data.
|--------------------------------------------------------------------------------------------------------------|

And I take a
training window with of 120
training window step size -1
test window with of 120
horizon 1

would it then do this?
|--------------------------------------------------------------------------------------------------------------|
||----------||----------||----------||----------||----------||----------||----------||----------||----------||----------||

And how do is split the inner windows into?
|-------...|
train, test

haddock

Hi Wessel,

As you have it set up

1. the window first lands when it has filled its training window, so last example =100,
2. Model gets built on training window,
3. Model is run on the test window, in this case examples 101- 200.
4. Then it advances the training window by the step size, which in your case means 100 because -1 makes the step size the same as the training window size, so the training window is now 101-200, and the test window 201-300,
5. Then it goes to step 2 and repeats until there are not enough examples to fill the test set.

If you run the following you'll see the sequence..

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="random"/>
        <parameter key="number_examples"	value="1200"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="SlidingWindowValidation" class="SlidingWindowValidation" expanded="yes">
        <operator name="NearestNeighbors" class="NearestNeighbors">
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier" breakpoints="before">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="RegressionPerformance" class="RegressionPerformance">
                <parameter key="root_mean_squared_error"	value="true"/>
            </operator>
        </operator>
    </operator>
</operator>

would it then do this?
|--------------------------------------------------------------------------------------------------------------|
||----------||----------||----------||----------||----------||----------||----------||----------||----------||----------||

And how do is split the inner windows into?
|-------...|
train, test

In the diagram above |----------||----------| represents a training set and a test set |---Train---||---Test---|

In the case of a horizon of 8 there would be 7 examples not used between the train and test set so like this..

|---Train---|horizon|---Test---|

So the window stops sliding when there are less than training window size + horizon -1 examples left after the last training example.

Hope that helps,

Good weekend!

wessel

1. How do I interpret these results?
- why isn't the correlation of Zero-R, 0?
- why isn't the relative_error of Zero-R, 100%?

2. How do I output my found models in the final result?
- so when I hit play, without setting break points

3. Is this finally a fair time-series comparison?

Weka Zero-R
PerformanceVector
PerformanceVector: absolute_error: 155.000 +/- 0.000 (mikro: 155.000 +/- 6.922)
relative_error: 19.53% +/- 8.29% (mikro: 19.53% +/- 8.32%)
normalized_absolute_error: 25.833 +/- 0.000 (mikro: 25.833)
correlation: 1.000 prediction_average: 922.500 +/- 325.552 (mikro: 922.500 +/- 325.625)
spearman_rho: 0.000 +/- 0.000 (mikro: 0.000)
kendall_tau: 0.000 +/- 0.000 (mikro: 0.000)

Linear Regression
PerformanceVector
PerformanceVector: absolute_error: 0.000 +/- 0.000 (mikro: 0.000 +/- 0.000)
relative_error: 0.00% +/- 0.00% (mikro: 0.00% +/- 0.00%)
normalized_absolute_error: 0.000 +/- 0.000 (mikro: 0.000)
correlation: 1.000 +/- 0.000 (mikro: 1.000)
prediction_average: 922.500 +/- 325.552 (mikro: 922.500 +/- 325.625)
spearman_rho: 1.000 +/- 0.000 (mikro: 47.000)
kendall_tau: 1.000 +/- 0.000 (mikro: 47.000)

<?xml version="1.0" encoding="windows-1252"?>
<process version="4.4">
<operator name="Root" class="Process" expanded="yes">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="1500 examples" class="ExampleSetGenerator">
<parameter key="target_function" value="random"/>
<parameter key="number_examples" value="1500"/>
<parameter key="number_of_attributes" value="1"/>
<parameter key="attributes_lower_bound" value="-10.0"/>
<parameter key="attributes_upper_bound" value="10.0"/>
<parameter key="local_random_seed" value="-1"/>
<parameter key="datamanagement" value="double_array"/>
</operator>
<operator name="IdTagging" class="IdTagging">
<parameter key="create_nominal_ids" value="false"/>
</operator>
<operator name="make ID regular" class="ChangeAttributeRole">
<parameter key="name" value="id"/>
<parameter key="target_role" value="regular"/>
</operator>
<operator name="rename id to wind" class="ChangeAttributeName">
<parameter key="old_name" value="id"/>
<parameter key="new_name" value="wind"/>
</operator>
<operator name="wind only" class="FeatureNameFilter">
<parameter key="filter_special_features" value="false"/>
<parameter key="skip_features_with_name" value=".*"/>
<parameter key="except_features_with_name" value="wind"/>
</operator>
<operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples">
<parameter key="series_representation" value="encode_series_by_examples"/>
<parameter key="horizon" value="0"/>
<parameter key="window_size" value="96"/>
<parameter key="step_size" value="1"/>
<parameter key="create_single_attributes" value="true"/>
<parameter key="add_incomplete_windows" value="false"/>
</operator>
<operator name="remove horizon attributes" class="FeatureNameFilter">
<parameter key="filter_special_features" value="false"/>
<parameter key="skip_features_with_name" value="wind-([1-9]|1[0-9]|2[0-3])"/>
</operator>
<operator name="set label: wind-0" class="ChangeAttributeRole">
<parameter key="name" value="wind-0"/>
<parameter key="target_role" value="label"/>
</operator>
<operator name="IOMultiplier" class="IOMultiplier">
<parameter key="number_of_copies" value="1"/>
<parameter key="io_object" value="ExampleSet"/>
<parameter key="multiply_type" value="multiply_one"/>
<parameter key="multiply_which" value="1"/>
</operator>
<operator name="SlidingWindowValidation ZR" class="SlidingWindowValidation" expanded="yes">
<parameter key="keep_example_set" value="false"/>
<parameter key="create_complete_model" value="false"/>
<parameter key="training_window_width" value="240"/>
<parameter key="training_window_step_size" value="-1"/>
<parameter key="test_window_width" value="24"/>
<parameter key="horizon" value="24"/>
<parameter key="cumulative_training" value="false"/>
<parameter key="average_performances_only" value="true"/>
<operator name="W-ZeroR" class="W-ZeroR">
<parameter key="keep_example_set" value="false"/>
<parameter key="D" value="false"/>
</operator>
<operator name="OperatorChain ZR" class="OperatorChain" expanded="yes">
<operator name="ModelApplier ZR" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="false"/>
</operator>
<operator name="RegressionPerformance ZR" class="RegressionPerformance">
<parameter key="keep_example_set" value="true"/>
<parameter key="main_criterion" value="relative_error"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="true"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="true"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="prediction_average" value="true"/>
<parameter key="spearman_rho" value="true"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
</operator>
</operator>
</operator>
<operator name="SlidingWindowValidation" class="SlidingWindowValidation" expanded="yes">
<parameter key="keep_example_set" value="false"/>
<parameter key="create_complete_model" value="false"/>
<parameter key="training_window_width" value="240"/>
<parameter key="training_window_step_size" value="-1"/>
<parameter key="test_window_width" value="24"/>
<parameter key="horizon" value="24"/>
<parameter key="cumulative_training" value="false"/>
<parameter key="average_performances_only" value="true"/>
<operator name="LinearRegression" class="LinearRegression">
<parameter key="keep_example_set" value="false"/>
<parameter key="feature_selection" value="M5 prime"/>
<parameter key="eliminate_colinear_features" value="true"/>
<parameter key="use_bias" value="true"/>
<parameter key="min_standardized_coefficient" value="1.5"/>
<parameter key="ridge" value="1.0E-8"/>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="false"/>
</operator>
<operator name="RegressionPerformance" class="RegressionPerformance">
<parameter key="keep_example_set" value="true"/>
<parameter key="main_criterion" value="relative_error"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="true"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="true"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="prediction_average" value="true"/>
<parameter key="spearman_rho" value="true"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
</operator>
</operator>
</operator>
</operator>
</process>

stever1k

haddock wrote:

Hi Wessel

I think my point was different ...

So let me try again. Here is an example of what I meant, it is simply the Fixed Split Sample with ID's and breakpoints inserted, so you can see the numbers of the examples that are being used for both training and test. Just check the highest training ID against the lowest test ID. If the highest trainer exceeds the lowest tester then training is taking place on examples that occur after a test case. With fixed split validation, such as you presented, this is inevitable, by definition.
[...]
Anyways, here's the demo.
[...]

hi, is your example meant to be a good or a bad one :-) because if I check the breakpoints, I get for the first breakpoint which is the training set:

highest ID is 150, which is the number of samples in the datafile.

for the 2nd breakpoint I get

So the highest trainer is 150, but the lowest tester is somewhere around 1...5 or so.

I

haddock

Hi Wessel,

ZeroR models predict that each label in the test set will have the average label value found in the training set.

1. How do I interpret these results?

By using ZeroR in a time-series you create a moving average predictor, lagged by the distance between the mid-point of the training set and the mid-point of the test set, put more prosaically as...

Training window/2 + Horizon-1 + Test window/2

So in your case

Lag=240/2 + 24-1 +24/2 = 120+23+12 = 155

- why isn't the correlation of Zero-R, 0?

As each label has a value one more than the one before, we would expect predictions to be, on average, less than actual by an amount equal to the lag, and that is what the absolute error shows. Moreover changes in prediction plot linearly against changes in actual, providing a correlation slope of 1; for these changes to provide correlation of 0.0 you'd need plots like the bottom row here..

http://en.wikipedia.org/wiki/Correlation

- why isn't the relative_error of Zero-R, 100%?

Because only two conditions produce a relative error of 100%, the first when the prediction is 0.00, the second when the absolute value of the prediction is twice the actual value; in your set up neither scenario is possible.

see also. http://en.wikipedia.org/wiki/Relative_error

2. How do I output my found models in the final result?

To use the models later within the process IOStore/IORetrieve, to use in another process ModelWriter/ModelLoader

- so when I hit play, without setting break points

see above.

3. Is this finally a fair time-series comparison?

In this context, how is 'fair' defined?

wessel

@ stever1k
My example is supposed to be as a good one for time series forecasting.
You are using the iris dataset, which is a classification problem, not a forecasting problem.
Using ID tagging, and an Example Set Generator I generate a a univariate time series, with target function v-0 = v-1 + 1.
Or I think you correctly name it like this. Maybe target function: v_t=0 = v_t=-1 + 1

Yet the goal of my set-up is to compare the performance of two different methods to time series forecasting.
It's currently only showing, for the 2 approaches in 2 different tabs.
I would like this information to be in 1 single tab.
absolute_error
relative_error
normalized_absolute_error
correlation
spearman_rho
kendall_tau

Also it would be nice if it said the number of training / testing examples used.
And the output of the model in text.
And if possible, also if the difference in performance is significant.