how to predict?

ahanazi
ahanazi New Altair Community Member
edited November 5 in Community Q&A
i;m new at RM.

if i have the following attribute:
ticker , date, close

how i can predict close price for next day? or next week?

BR
Tagged:

Answers

  • wessel
    wessel New Altair Community Member
    1. save the text below as an xml file and open in in RM
    2. click on "CSVExampleSource", and change it so loads your own dataset.
    3. My dataset also uses the class attribute label: "close" so this should be the same.
    If not chance all occurrences of close to your own label.
    4. Change W-REPTree to any regression learning algorithm that suits you.


    #### example.xml ####
    <operator name="Root" class="Process" expanded="yes">
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename" value="D:\wessel\Desktop\testBook1.csv"/>
        </operator>
        <operator name="MultivariateSeries2WindowExamples (2)" class="MultivariateSeries2WindowExamples">
            <parameter key="window_size" value="2"/>
            <parameter key="label_attribute" value="close"/>
            <parameter key="add_incomplete_windows" value="true"/>
        </operator>
        <operator name="ChangeAttributeName" class="ChangeAttributeName">
            <parameter key="old_name" value="label"/>
            <parameter key="new_name" value="close"/>
        </operator>
        <operator name="FeatureNameFilter" class="FeatureNameFilter">
            <parameter key="skip_features_with_name" value="close-0"/>
        </operator>
        <operator name="FixedSplitValidation" class="FixedSplitValidation" expanded="yes">
            <parameter key="training_set_size" value="30"/>
            <operator name="W-REPTree" class="W-REPTree">
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                    <parameter key="create_view" value="true"/>
                </operator>
                <operator name="RegressionPerformance" class="RegressionPerformance">
                    <parameter key="root_mean_squared_error" value="true"/>
                    <parameter key="absolute_error" value="true"/>
                    <parameter key="relative_error" value="true"/>
                </operator>
            </operator>
        </operator>
    </operator>

  • ahanazi
    ahanazi New Altair Community Member
    thanks a lot.

    can you explain where is exacly you set the prediction ittem?and how can i control it, i mean how i predict one day or 2 days or three days??
    why other attrbute give open-0, open-1...what does this mean?

    i hope i'm not bothering you as i'm new for RM.

    BR
  • haddock
    haddock New Altair Community Member
    Chaps,

    Please be very careful about the validation operator you use, with what you have you can easily be training on examples that occur after the examples you are testing! It makes more sense to train on the past, or am I missing something? I think I brought this up quite recently, http://rapid-i.com/rapidforum/index.php/topic,908.msg3395.html#msg3395  , and amazingly here as well http://rapid-i.com/rapidforum/index.php/topic,954.msg3593.html#msg3593 so I'm probably wasting my time bringing it up again.
  • wessel
    wessel New Altair Community Member
    Yes, very good point Haddock.

    But are we really training on future examples, and testing on past examples?

    Let's say we have a simple dataset, made by hand.
    It consists of 3 numerical attributes:
      t: "hour of the day"
      u: "barometer"
      v: "wind speed"
    And it only has 8 example instances.

    ## Dataset
    t      u      v
    --------------
    t1    u1    v1
    t2    u2    v2
    t3    u3    v3
    t4    u4    v4
    t5    u5    v5
    t6    u6    v6
    t7    u7    v7
    t8    u8    v8


    ## Forecasting v9
    Let's say we wish to predict the wind speed at the next hour, so the value of v+1.
    We already know the value of t+1, since time or "hour of the day" is a fully determinative attribute.
    t+1 = t+0 + 1; if(t+1 == 24) {t+1 = 0}
    Unlike the value of u+1, since we cannot look ahead into the future and see what our barometer looks like.

    ## Start simple
    So we could start simple, throw away barometer for now, and see how well we can forecast v+1 using just t. Intuitively I would do a “keep order” 66% split to test the performance.
    So train on:
    t1    v1
    t2    v2
    t3    v3
    t4    v4
    t5    v5

    And test on:
    t6    .
    t7    .
    t8    .


    But I guess training on a “random order” 66% split would work just as well.
    So train on: (randomly selected 6,4,8,1,2)
    t6    v6
    t4    v4
    t8    v8
    t1    v1
    t2    v2

    And test on:
    t7    .
    t5    .
    t3    .

    Or am I making some thinking error here?

    ## A bit harder
    Now when we make the problem a bit harder, by adding barometer information, I think this still holds.

    Let’s say we take window size of 2, so we are adding the information of the barometer readings from the pervious hour u-1. Then our dataset would look like this:

    t-0      u-0      v-0      t-1    u-1    v-1
    --------------------------------------------
    t1        u1      v1      ?        ?        ?
    t2        u2      v2      t1      u1      v1
    t3        u3      v3      t2      u2      v2
    t4        u4      v4      t3      u3      v3
    t5        u5      v5      t4      u4      v4
    t6        u6      v6      t5      u5      v5
    t7        u7      v7      t6      u6      v6
    t8        u8      v8      t7      u7      v7

    ## Throw away
    I don’t think adding a training example with ? adds any useful information. So I think its best to just throw it away. But maybe I’m wrong here.
    Secondly we can’t use training examples of the form:
    t-0      u-0      v-0      t-1    u-1    v-1
    Because u-0 is future information.
    (t-0 isn’t because t is fully deterministic)
    So we have to throw u-0 away.
    But since t-0 and t-1 are correlated 100% we might as well also throw it away, without loosing any information. So then we end up with a new dataset:
    v-0      t-1    u-1    v-1
    ---------------------------
    v2      t1      u1      v1
    v3      t2      u2      v2
    v4      t3      u3      v3
    v5      t4      u4      v4
    v6      t5      u5      v5
    v7      t6      u6      v6
    v8      t7      u7      v7


    Now does it matter in what order you feed your training examples to your learning algorithm? Well that’s a hard question actually. I think it does, depending on the learning algorithm.

    And I guess you could test this quite easily.
    Evaluate the performance of your learning algorithm 3 times, using cross validation, random order percentage split, and fixed order percentage split. And see if there is a significant difference.






  • haddock
    haddock New Altair Community Member
    Hi Wessel
    But are we really training on future examples, and testing on past examples?
    I think my point was different ...
    with what you have you can easily be training on examples that occur after the examples you are testing!
    So let me try again. Here is an example of what I meant, it is simply the Fixed Split Sample with ID's and breakpoints inserted, so you can see the numbers of the examples that are being used for both training and test. Just check the highest training ID against the lowest test ID. If the highest trainer exceeds the lowest tester then training is taking place on examples that occur after a test case. With fixed split validation, such as you presented, this is inevitable, by definition.

    While this doesn't matter if the domain, like the study of the humble Iris, exhibits constant properties, it is less clear that it is in any way appropriate with closing prices. Surely guessing 10 days hence is easier if you have information about day 9?

    Anyways, here's the demo.
    <operator name="Root" class="Process" expanded="yes">
       <operator name="ArffExampleSource" class="ArffExampleSource">
           <parameter key="data_file" value="../data/iris.arff"/>
           <parameter key="label_attribute" value="class"/>
       </operator>
       <operator name="IdTagging" class="IdTagging">
       </operator>
       <operator name="FixedSplitValidation" class="FixedSplitValidation" expanded="yes">
           <operator name="NearestNeighbors" class="NearestNeighbors" breakpoints="before">
           </operator>
           <operator name="ApplierChain" class="OperatorChain" expanded="yes">
               <operator name="Test" class="ModelApplier" breakpoints="after">
                   <list key="application_parameters">
                   </list>
               </operator>
               <operator name="Performance" class="Performance">
               </operator>
           </operator>
       </operator>
    </operator>
  • wessel
    wessel New Altair Community Member
    Hmm I think what you are trying to say is:

    Dataset:
    v-0      t-1     u-1     v-1
    ---------------------------
    v2       t1       u1      v1
    v3       t2       u2      v2
    v4       t3       u3      v3
    v5       t4       u4      v4
    v6       t5       u5      v5
    v7       t6       u6      v6
    v8       t7       u7      v7

    Good:
    Sampling_type == "linear"
    v2       t1       u1      v1    TRAIN
    v3       t2       u2      v2    
    v4       t3       u3      v3    
    v5       t4       u4      v4    
    v6       t5       u5      v5    

    v7       t6       u6      v6   TEST
    v8       t7       u7      v7  

    Bad:
    Sampling_type == "shuffled"
    v2       t1       u1      v1    TRAIN
    v8       t7       u7      v7  
    v5       t4       u4      v4    
    v4       t3       u3      v3    
    v6       t5       u5      v5    

    v7       t6       u6      v6   TEST
    v3       t2       u2      v2    


    I don't think this is either GOOD or BAD.
    I'm not sure for what learners the order of trainingsexamples matters.
    I guess it doesn't matter for tree's, nearest neighbour, linear regression, Bayesian networks, because they take an entire exampleset as input.
    I guess it does matter for neural networks, updatable Baysian networks, because they iterate over the example set.

    Of course it ALWAYS matters when your trying to learn a non-static target function.
    Let's say we're trying to predict the temperature for the next day.
    Since the earth is getting hotter and hotter its no good learning from data from a 1000 year ago.
    But then you just have to throw away the old data I guess.
  • haddock
    haddock New Altair Community Member
    Hmm I think what you are trying to say is:
    I think what I'm trying to say is...
    Please be very careful about the validation operator you use, with what you have you can easily be training on examples that occur after the examples you are testing! It makes more sense to train on the past, or am I missing something? I think I brought this up quite recently, http://rapid-i.com/rapidforum/index.php/topic,908.msg3395.html#msg3395 ; , and amazingly here as well http://rapid-i.com/rapidforum/index.php/topic,954.msg3593.html#msg3593 so I'm probably wasting my time bringing it up again.
    ;)

    Or putting it more succinctly, if you want credible results in economic time-series forecasting use sliding window validations.
  • wessel
    wessel New Altair Community Member
    How does sliding window validation work?
    If it makes generates testing examples the same way as "MultivariateSeries2WindowExamples" then it should be the same?

    The same as:
    Good:
    Sampling_type == "linear"
    v2      t1      u1      v1    TRAIN
    v3      t2      u2      v2   
    v4      t3      u3      v3   
    v5      t4      u4      v4   
    v6      t5      u5      v5   

    v7      t6      u6      v6  TEST
    v8      t7      u7      v7 
  • haddock
    haddock New Altair Community Member
    If it makes generates testing examples the same way as "MultivariateSeries2WindowExamples" then it should be the same?
    But it doesn't, it has a lookback horizon, as well as a lookforward horizon.
  • wessel
    wessel New Altair Community Member
    Okay, I agree, a sliding window validation would be ideal.

    But let's say I have 1200 hours of data.
    |--------------------------------------------------------------------------------------------------------------|

    And I take a
    training window with of 120
    training window step size -1
    test window with of 120
    horizon 1

    would it then do this?
    |--------------------------------------------------------------------------------------------------------------|
    ||----------||----------||----------||----------||----------||----------||----------||----------||----------||----------||

    And how do is split the inner windows into?
    |-------...|
    train, test


  • haddock
    haddock New Altair Community Member
    Hi Wessel,

    As you have it set up

    1. the window first lands when it has filled its training window, so last example =100,
    2. Model gets built on training window,
    3. Model is run on the test window, in this case examples 101- 200.
    4. Then it advances the training window by the step size, which in your case means 100 because -1 makes the step size the same as the training window size, so the training window is now 101-200, and the test window 201-300,
    5. Then it goes to step 2 and repeats until there are not enough examples to fill the test set.

    If you run the following you'll see the sequence..
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="random"/>
            <parameter key="number_examples" value="1200"/>
        </operator>
        <operator name="IdTagging" class="IdTagging">
        </operator>
        <operator name="SlidingWindowValidation" class="SlidingWindowValidation" expanded="yes">
            <operator name="NearestNeighbors" class="NearestNeighbors">
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier" breakpoints="before">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="RegressionPerformance" class="RegressionPerformance">
                    <parameter key="root_mean_squared_error" value="true"/>
                </operator>
            </operator>
        </operator>
    </operator>
    would it then do this?
    |--------------------------------------------------------------------------------------------------------------|
    ||----------||----------||----------||----------||----------||----------||----------||----------||----------||----------||

    And how do is split the inner windows into?
    |-------...|
    train, test
    In the diagram above |----------||----------| represents a training set and a test set |---Train---||---Test---|

    In the case of a horizon of 8 there would be 7 examples not used between the train and test set so like this..

    |---Train---|horizon|---Test---|

    So the window stops sliding when there are less than training window size + horizon -1 examples left after the last training example.

    Hope that helps,

    Good weekend!
  • wessel
    wessel New Altair Community Member
    1. How do I interpret these results?
    - why isn't the correlation of Zero-R, 0?
    - why isn't the relative_error of Zero-R, 100%?

    2. How do I output my found models in the final result?
    - so when I hit play, without setting break points

    3. Is this finally a fair time-series comparison?

    Weka Zero-R
    PerformanceVector
    PerformanceVector: absolute_error: 155.000 +/- 0.000 (mikro: 155.000 +/- 6.922)
    relative_error: 19.53% +/- 8.29% (mikro: 19.53% +/- 8.32%)
    normalized_absolute_error: 25.833 +/- 0.000 (mikro: 25.833)
    correlation: 1.000 prediction_average: 922.500 +/- 325.552 (mikro: 922.500 +/- 325.625)
    spearman_rho: 0.000 +/- 0.000 (mikro: 0.000)
    kendall_tau: 0.000 +/- 0.000 (mikro: 0.000)

    Linear Regression
    PerformanceVector
    PerformanceVector: absolute_error: 0.000 +/- 0.000 (mikro: 0.000 +/- 0.000)
    relative_error: 0.00% +/- 0.00% (mikro: 0.00% +/- 0.00%)
    normalized_absolute_error: 0.000 +/- 0.000 (mikro: 0.000)
    correlation: 1.000 +/- 0.000 (mikro: 1.000)
    prediction_average: 922.500 +/- 325.552 (mikro: 922.500 +/- 325.625)
    spearman_rho: 1.000 +/- 0.000 (mikro: 47.000)
    kendall_tau: 1.000 +/- 0.000 (mikro: 47.000)


    <?xml version="1.0" encoding="windows-1252"?>
    <process version="4.4">
      <operator name="Root" class="Process" expanded="yes">
          <parameter key="logverbosity" value="init"/>
          <parameter key="random_seed" value="2001"/>
          <parameter key="encoding" value="SYSTEM"/>
          <operator name="1500 examples" class="ExampleSetGenerator">
              <parameter key="target_function" value="random"/>
              <parameter key="number_examples" value="1500"/>
              <parameter key="number_of_attributes" value="1"/>
              <parameter key="attributes_lower_bound" value="-10.0"/>
              <parameter key="attributes_upper_bound" value="10.0"/>
              <parameter key="local_random_seed" value="-1"/>
              <parameter key="datamanagement" value="double_array"/>
          </operator>
          <operator name="IdTagging" class="IdTagging">
              <parameter key="create_nominal_ids" value="false"/>
          </operator>
          <operator name="make ID regular" class="ChangeAttributeRole">
              <parameter key="name" value="id"/>
              <parameter key="target_role" value="regular"/>
          </operator>
          <operator name="rename id to wind" class="ChangeAttributeName">
              <parameter key="old_name" value="id"/>
              <parameter key="new_name" value="wind"/>
          </operator>
          <operator name="wind only" class="FeatureNameFilter">
              <parameter key="filter_special_features" value="false"/>
              <parameter key="skip_features_with_name" value=".*"/>
              <parameter key="except_features_with_name" value="wind"/>
          </operator>
          <operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples">
              <parameter key="series_representation" value="encode_series_by_examples"/>
              <parameter key="horizon" value="0"/>
              <parameter key="window_size" value="96"/>
              <parameter key="step_size" value="1"/>
              <parameter key="create_single_attributes" value="true"/>
              <parameter key="add_incomplete_windows" value="false"/>
          </operator>
          <operator name="remove horizon attributes" class="FeatureNameFilter">
              <parameter key="filter_special_features" value="false"/>
              <parameter key="skip_features_with_name" value="wind-([1-9]|1[0-9]|2[0-3])"/>
          </operator>
          <operator name="set label: wind-0" class="ChangeAttributeRole">
              <parameter key="name" value="wind-0"/>
              <parameter key="target_role" value="label"/>
          </operator>
          <operator name="IOMultiplier" class="IOMultiplier">
              <parameter key="number_of_copies" value="1"/>
              <parameter key="io_object" value="ExampleSet"/>
              <parameter key="multiply_type" value="multiply_one"/>
              <parameter key="multiply_which" value="1"/>
          </operator>
          <operator name="SlidingWindowValidation ZR" class="SlidingWindowValidation" expanded="yes">
              <parameter key="keep_example_set" value="false"/>
              <parameter key="create_complete_model" value="false"/>
              <parameter key="training_window_width" value="240"/>
              <parameter key="training_window_step_size" value="-1"/>
              <parameter key="test_window_width" value="24"/>
              <parameter key="horizon" value="24"/>
              <parameter key="cumulative_training" value="false"/>
              <parameter key="average_performances_only" value="true"/>
              <operator name="W-ZeroR" class="W-ZeroR">
                  <parameter key="keep_example_set" value="false"/>
                  <parameter key="D" value="false"/>
              </operator>
              <operator name="OperatorChain ZR" class="OperatorChain" expanded="yes">
                  <operator name="ModelApplier ZR" class="ModelApplier">
                      <parameter key="keep_model" value="true"/>
                      <list key="application_parameters">
                      </list>
                      <parameter key="create_view" value="false"/>
                  </operator>
                  <operator name="RegressionPerformance ZR" class="RegressionPerformance">
                      <parameter key="keep_example_set" value="true"/>
                      <parameter key="main_criterion" value="relative_error"/>
                      <parameter key="root_mean_squared_error" value="false"/>
                      <parameter key="absolute_error" value="true"/>
                      <parameter key="relative_error" value="true"/>
                      <parameter key="relative_error_lenient" value="false"/>
                      <parameter key="relative_error_strict" value="false"/>
                      <parameter key="normalized_absolute_error" value="true"/>
                      <parameter key="root_relative_squared_error" value="false"/>
                      <parameter key="squared_error" value="false"/>
                      <parameter key="correlation" value="true"/>
                      <parameter key="squared_correlation" value="false"/>
                      <parameter key="prediction_average" value="true"/>
                      <parameter key="spearman_rho" value="true"/>
                      <parameter key="kendall_tau" value="false"/>
                      <parameter key="skip_undefined_labels" value="true"/>
                      <parameter key="use_example_weights" value="true"/>
                  </operator>
              </operator>
          </operator>
          <operator name="SlidingWindowValidation" class="SlidingWindowValidation" expanded="yes">
              <parameter key="keep_example_set" value="false"/>
              <parameter key="create_complete_model" value="false"/>
              <parameter key="training_window_width" value="240"/>
              <parameter key="training_window_step_size" value="-1"/>
              <parameter key="test_window_width" value="24"/>
              <parameter key="horizon" value="24"/>
              <parameter key="cumulative_training" value="false"/>
              <parameter key="average_performances_only" value="true"/>
              <operator name="LinearRegression" class="LinearRegression">
                  <parameter key="keep_example_set" value="false"/>
                  <parameter key="feature_selection" value="M5 prime"/>
                  <parameter key="eliminate_colinear_features" value="true"/>
                  <parameter key="use_bias" value="true"/>
                  <parameter key="min_standardized_coefficient" value="1.5"/>
                  <parameter key="ridge" value="1.0E-8"/>
              </operator>
              <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                  <operator name="ModelApplier" class="ModelApplier">
                      <parameter key="keep_model" value="true"/>
                      <list key="application_parameters">
                      </list>
                      <parameter key="create_view" value="false"/>
                  </operator>
                  <operator name="RegressionPerformance" class="RegressionPerformance">
                      <parameter key="keep_example_set" value="true"/>
                      <parameter key="main_criterion" value="relative_error"/>
                      <parameter key="root_mean_squared_error" value="false"/>
                      <parameter key="absolute_error" value="true"/>
                      <parameter key="relative_error" value="true"/>
                      <parameter key="relative_error_lenient" value="false"/>
                      <parameter key="relative_error_strict" value="false"/>
                      <parameter key="normalized_absolute_error" value="true"/>
                      <parameter key="root_relative_squared_error" value="false"/>
                      <parameter key="squared_error" value="false"/>
                      <parameter key="correlation" value="true"/>
                      <parameter key="squared_correlation" value="false"/>
                      <parameter key="prediction_average" value="true"/>
                      <parameter key="spearman_rho" value="true"/>
                      <parameter key="kendall_tau" value="false"/>
                      <parameter key="skip_undefined_labels" value="true"/>
                      <parameter key="use_example_weights" value="true"/>
                  </operator>
              </operator>
          </operator>
      </operator>
    </process>
  • stever1k
    stever1k New Altair Community Member
    haddock wrote:

    Hi Wessel

    I think my point was different ...

    So let me try again. Here is an example of what I meant, it is simply the Fixed Split Sample with ID's and breakpoints inserted, so you can see the numbers of the examples that are being used for both training and test. Just check the highest training ID against the lowest test ID. If the highest trainer exceeds the lowest tester then training is taking place on examples that occur after a test case. With fixed split validation, such as you presented, this is inevitable, by definition.
    [...]
    Anyways, here's the demo.
    [...]
    hi, is your example meant to be a good or a bad one :-) because if I check the breakpoints, I get for the first breakpoint which is the training set:

    image

    highest ID is 150, which is the number of samples in the datafile.

    for the 2nd breakpoint I get
    image

    So the highest trainer is 150, but the lowest tester is somewhere around 1...5 or so.

    I
  • haddock
    haddock New Altair Community Member
    Hi Wessel,

    ZeroR models predict that each label in the test set will have the average label value found in the training set.
    1. How do I interpret these results?
    By using ZeroR in a time-series you create a moving average predictor, lagged by the distance between the mid-point of the training set and the mid-point of the test set, put more prosaically as...

    Training window/2 + Horizon-1 + Test window/2

    So in your case

    Lag=240/2 + 24-1 +24/2 = 120+23+12 = 155
    - why isn't the correlation of Zero-R, 0?
    As each label has a value one more than the one before, we would expect predictions to be, on average, less than actual by an amount equal to the lag, and that is what the absolute error shows. Moreover changes in prediction plot linearly against changes in actual, providing a correlation slope of 1; for these changes to provide correlation of 0.0 you'd need plots like the bottom row here..

    http://en.wikipedia.org/wiki/Correlation
    - why isn't the relative_error of Zero-R, 100%?
    Because only two conditions produce a relative error of 100%, the first when the prediction is 0.00, the second when the absolute value of the prediction is twice the actual value; in your set up neither scenario is possible.

    see also. http://en.wikipedia.org/wiki/Relative_error
    2. How do I output my found models in the final result?
    To use the models later within the process IOStore/IORetrieve, to use in another process ModelWriter/ModelLoader
    - so when I hit play, without setting break points
    see above.
    3. Is this finally a fair time-series comparison?
    In this context, how is 'fair' defined?
  • wessel
    wessel New Altair Community Member
    @ stever1k
    My example is supposed to be as a good one for time series forecasting.
    You are using the iris dataset, which is a classification problem, not a forecasting problem.
    Using ID tagging, and an Example Set Generator I generate a a univariate time series, with target function v-0 = v-1 + 1.
    Or I think you correctly name it like this. Maybe target function: v_t=0 = v_t=-1 + 1

    Yet the goal of my set-up is to compare the performance of two different methods to time series forecasting.
    It's currently only showing, for the 2 approaches in 2 different tabs.
    I would like this information to be in 1 single tab.
    absolute_error
    relative_error
    normalized_absolute_error
    correlation
    spearman_rho
    kendall_tau

    Also it would be nice if it said the number of training / testing examples used.
    And the output of the model in text.
    And if possible, also if the difference in performance is significant.