[Solved] Another kind of performance measurement for time series
qwertz
New Altair Community Member
Especially in financial data mining one would build a model not on the actual stock price but on the difference to the last day.
Consequently, the result of a prediction process will be an estimation about the change of the price from one day until the next.
The currently available "forecasting performance" operator for series determines whether the prediction trend is correct.
(e.g. delta[today] = 4; delta[prediction for tomorrow] = 6; delta[tomorrow] = 5 >> trend is true because tomorrow>today AND prediction>today)
In order to determine win/loss this is not sufficient.
(e.g. delta[today] = -4; delta[prediction for tomorrow] = -3; delta[tomorrow] = -2 >> trend is true but the share still loses value)
Hence, the main question rather is wether delta[tomorrow] will be positive or negative.
(e.g. delta[prediction for tomorrow] = -3; delta[tomorrow] = -2 >> trend should be true because prediction and tomorrow have the same sign)
(e.g. delta[prediction for tomorrow] = 4; delta[tomorrow] = -1 >> trend should be false)
(e.g. delta[prediction for tomorrow] = 1; delta[tomorrow] = 3 >> trend should be true)
Can anyone help how to realize this kind of performance measurement?
PS: With the existing operator I discovered pretty good prediction trend accuracy rates of 0.7 to 0.8 but the overall win/loss simulation was only slightly above 0.5 due to the issue described above. So I was wondering whether another data preprocessing could help (e.g. transform the stock values into binominal data like "up" and "down" but SVMs are not able to handle binominal data). So far I calculate the daily percental change for all attributes and the label. The best correlating attributes are then used to build a model in the SVM. Does anyone happen to know wether there are other essential steps in preprocessing to improve prediction quality?
Thank you for your help!
Kind regards
Sachs
Consequently, the result of a prediction process will be an estimation about the change of the price from one day until the next.
The currently available "forecasting performance" operator for series determines whether the prediction trend is correct.
(e.g. delta[today] = 4; delta[prediction for tomorrow] = 6; delta[tomorrow] = 5 >> trend is true because tomorrow>today AND prediction>today)
In order to determine win/loss this is not sufficient.
(e.g. delta[today] = -4; delta[prediction for tomorrow] = -3; delta[tomorrow] = -2 >> trend is true but the share still loses value)
Hence, the main question rather is wether delta[tomorrow] will be positive or negative.
(e.g. delta[prediction for tomorrow] = -3; delta[tomorrow] = -2 >> trend should be true because prediction and tomorrow have the same sign)
(e.g. delta[prediction for tomorrow] = 4; delta[tomorrow] = -1 >> trend should be false)
(e.g. delta[prediction for tomorrow] = 1; delta[tomorrow] = 3 >> trend should be true)
Can anyone help how to realize this kind of performance measurement?
PS: With the existing operator I discovered pretty good prediction trend accuracy rates of 0.7 to 0.8 but the overall win/loss simulation was only slightly above 0.5 due to the issue described above. So I was wondering whether another data preprocessing could help (e.g. transform the stock values into binominal data like "up" and "down" but SVMs are not able to handle binominal data). So far I calculate the daily percental change for all attributes and the label. The best correlating attributes are then used to build a model in the SVM. Does anyone happen to know wether there are other essential steps in preprocessing to improve prediction quality?
Thank you for your help!
Kind regards
Sachs
Tagged:
0
Answers
-
Hi,
you could probably use a combination of Generate Attributes and Aggregate to calculate any desired performance measure. Of course those operators work on example sets and write their results into an example set, but once you have the final value you can extract it as a performance measure with the Extract Performance operator with performance_type set to data_value.
Hope this helps!
Best regards,
Marius0 -
Use the script operator?
Alternatively, convert your data to differences each day, so data points are actual deltas that are computed in advance?0 -
Thank you for your replies!
I am not familiar with the script operator yet - so I am going to try the combination of Generate Attributes and Aggregate first.
I don't get the second part on converting to differences each day. The input data is already the difference. But if the predicted trend is positive it doesn't mean necessarily an absolute positive result as today's difference could be e.g. -5 and prediction is -3. So the trend is up but still the overall result is negative.
Best regards
Sachs0 -
Write me some pseudo code, I can write you the script operator code.
There is the operator called "Predict Series"
This gives you "real" and "predicted" for your label attribute.
So you have 2 arrays with N data points
real.length() = N and predicted.length() = N
Can you write pseudo code with this arrays?
0 -
Sorry, I think I don't get what you are saying. The main problem is about the evaluation. So far I used the "Forecasting Performance" operator for that.
"Predict Series" unfortunatelly won't work with my example because this operator requires univariate data. However, I believe that the prediction part of the model works already fine.
Any comments welcome...
Best regards
Sachs0 -
You normally do predict series and than script operator to do manually do something very similar to forecasting performance.
As far as I'm aware you can run predict series and then implement a script that does something like:
(e.g. delta[prediction for tomorrow] = -3; delta[tomorrow] = -2 >> trend should be true because prediction and tomorrow have the same sign)
(e.g. delta[prediction for tomorrow] = 4; delta[tomorrow] = -1 >> trend should be false)
(e.g. delta[prediction for tomorrow] = 1; delta[tomorrow] = 3 >> trend should be true)
But maybe I misunderstood from the beginning, in this case I'm sorry.
Best regards,
Wessel0 -
Hi Wessel,
Sorry that it took me so long to answer your kindful offer. Though it it took a long time it doesn't mean that it is less important to me. The delay is caused by a multiple months travel and the access to internet is very limited.
I am going to write some pseudo code and post it in the next days.
Thank you very much!
Sachs0 -
I'm still here :P0
-
After taking some time to think about the pseudo code it turned out that the formula is similar to the one of prediction trend accuracy (PTA)
PTA is described in http://rapid-i.com/api/rapidminer-4.6/com/rapidminer/operator/performance/PredictionTrendAccuracy.html
Measures the number of times a regression prediction correctly determines the trend. This performance measure assumes that the attributes of each example represents the values of a time window, the label is a value after a certain horizon which should be predicted. All examples build a consecutive series description, i.e. the labels of all examples build the series itself (this is, for example, the case for a windowing step size of 1). This format will be delivered by the Series2ExampleSet operators provided by RapidMiner.
Example: Lets think of a series v1...v10 and a sliding window with window width 3, step size 1 and prediction horizon 1. The resulting example set is then
T1 T2 T3 L P
---------------
v1 v2 v3 v4 p1
v2 v3 v4 v5 p2
v3 v4 v5 v6 p3
v4 v5 v6 v7 p4
v5 v6 v7 v8 p5
v6 v7 v8 v9 p6
v7 v8 v9 v10 p7
The second last column (L) corresponds to the label, i.e. the value which should be predicted and the last column (P) corresponds to the predictions. The columns T1, T2, and T3 correspond to the regular attributes, i.e. the points which should be used as learning input.
This performance measure then calculates the actuals trend between the last time point in the series (T3 here) and the actual label (L) and compares it to the trend between T3 and the prediction (P), sums the products between both trends, and divides this sum by the total number of examples, i.e. [(if ((v4-v3)*(p1-v3)>=0), 1, 0) + (if ((v5-v4)*(p2-v4)>=0), 1, 0) +...] / 7 in this example.
In contrast to PTA I need a formula which calculates [(if ((v4)*(p1)>=0), 1, 0) + (if ((v5)*(p2)>=0), 1, 0) +...] / 7
In other words: The substraction is left out.
I would appreciate your help very much!!
Kind regards
Sachs0 -
Thanks to Wessel here is the piece of code which creates almost every performance one could imagine.
Cheers
Sachs
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="5.3.008" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30"/>
<operator activated="true" class="subprocess" compatibility="5.3.008" expanded="true" height="76" name="Subprocess" width="90" x="179" y="30">
<process expanded="true">
<operator activated="true" class="generate_attributes" compatibility="5.3.008" expanded="true" height="76" name="Generate Attributes" width="90" x="45" y="30">
<list key="function_descriptions">
<parameter key="new_performance" value="1*2"/>
</list>
</operator>
<operator activated="true" class="extract_performance" compatibility="5.3.008" expanded="true" height="76" name="Performance" width="90" x="180" y="30">
<parameter key="performance_type" value="statistics"/>
<parameter key="attribute_name" value="new_performance"/>
</operator>
<connect from_port="in 1" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Performance" to_port="example set"/>
<connect from_op="Performance" from_port="performance" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Subprocess" to_port="in 1"/>
<connect from_op="Subprocess" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0