Visit rate prediction

dusankat · January 2010

Hi,
as a part of my diploma thesis I am trying to predict visit rate at various sport centres. I have a data set in a form:

date month day visit

My goal is to predict visit rate at least 30 days into the future. Because visit rate is highly seasonal variable, I added month (1-12) and day in a week (1-7) to help the prediction.

For about a week I am trying to setup the process in Rapidminer, find out which operators to use (I read almost every thread on this forum about time series prediction), but prediction trend accuracy is still too low (about 60-70%) and the actual predictions don't look good

I am a beginner in a field of data mining (and RM), so I don't know if the problem is in the process or quality of data.

Process:

<operator name="Root" class="Process" expanded="yes">
    <operator name="CSVExampleSource" class="CSVExampleSource">
        <parameter key="filename"	value="C:\Users\Excalibur\Desktop\hall1.csv"/>
        <parameter key="id_column"	value="1"/>
    </operator>
    <operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples">
        <parameter key="horizon"	value="30"/>
        <parameter key="window_size"	value="20"/>
        <parameter key="label_attribute"	value="visit"/>
    </operator>
    <operator name="SlidingWindowValidation" class="SlidingWindowValidation" expanded="yes">
        <parameter key="create_complete_model"	value="true"/>
        <parameter key="training_window_width"	value="120"/>
        <parameter key="training_window_step_size"	value="1"/>
        <parameter key="test_window_width"	value="60"/>
        <parameter key="horizon"	value="30"/>
        <operator name="LibSVMLearner" class="LibSVMLearner">
            <parameter key="svm_type"	value="epsilon-SVR"/>
            <parameter key="gamma"	value="0.0010"/>
            <parameter key="C"	value="10.0"/>
            <list key="class_weights">
            </list>
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="ForecastingPerformance" class="ForecastingPerformance">
                <parameter key="horizon"	value="30"/>
                <parameter key="prediction_trend_accuracy"	value="true"/>
            </operator>
            <operator name="ProcessLog" class="ProcessLog">
                <list key="log">
                  <parameter key="trend_accuracy"	value="operator.ForecastingPerformance.value.prediction_trend_accuracy"/>
                  <parameter key="performance1"	value="operator.SlidingWindowValidation.value.performance"/>
                  <parameter key="performance2"	value="operator.SlidingWindowValidation.value.performance2"/>
                </list>
            </operator>
        </operator>
    </operator>
    <operator name="ModelWriter" class="ModelWriter">
        <parameter key="model_file"	value="C:\Users\Excalibur\Desktop\hall1.mod"/>
        <parameter key="overwrite_existing_file"	value="false"/>
    </operator>
</operator>

1) Should the parameter horizon match up in all operators?

2) I am not sure, if I fully understand sliding window validation - in first iteration model is trained on 120 examples (1-120), then apllied on a test set (examples 151-210, because we skipped 30 (horizon) examples?).Prediction is calculated for each example in a test set and compared with the value 30 days ahead (e.g. prediction for example 151 is compared with example 181?)

3) As suggested in one thread, I tried to use WindowExamples2ModelingData (right after MultivariateSeries2WindowExamples) to increase prediction accuracy. I set label_name_stem to 'visit', but I was confused with the parameter horizon. There was an error, when I set horizon to 30: The value '30' for the parameter 'horizon' cannot be used: the horizon has to be larger than the window width.
But horizon (30) is larger than window width (20), or window width is meant to be something else in this context? I got it to work only if window_size in MultivariateSeries2WindowExamples was larger than horizon in WindowExamples2ModelingData.

4)I found only one way to get those 30 predictions (and I think it is not the right one) - added 30 new examples (date,month,day,blank visit) into example set, applied same preprocessing steps as I applied by creating the model, apllied the model and got new attribute prediction(label). But it seemed, that values of this attribute were shifted by 30 rows - e.g prediction for 1.1. was in fact prediction for 31.1. Is there any operator, which can shift the values horizon steps, so that I can easily compare real and predicted values?

Thanks in advance,
Dusan

haddock · January 2010

Hi there Dusankat,

Firstly, welcome to this corner of the lunatic asylum, where we get mesmerised by patterns, and imagine the significance they may have. Now, down to biz...

It seems to me that there are two areas that you might want to re-visit, namely the SVM learning parameters and the horizon question. On the former there is quite an amount of literature suggesting that results are sensitive to values for C, gamma, and I think epsilon in your case, so you could make a parameter tweaker check out the combos and see whether that improves things. On the latter, the question of the horizon, it is important to understand the jargon. Let me explain..

As you have it each validation uses the last 120 examples to make a model, which it tests on a period that starts after a gap of 30 examples, and continues for 60 examples. So the first validation happens at example 120, on examples 150-209, as this alteration shows...

<operator name="Root" class="Process" expanded="yes">
    <operator name="CSVExampleSource" class="CSVExampleSource" activated="no">
        <parameter key="filename"	value="C:\Users\Excalibur\Desktop\hall1.csv"/>
        <parameter key="id_column"	value="1"/>
    </operator>
    <operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples" activated="no">
        <parameter key="horizon"	value="30"/>
        <parameter key="window_size"	value="20"/>
        <parameter key="label_attribute"	value="visit"/>
    </operator>
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="random"/>
        <parameter key="number_examples"	value="210"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="SlidingWindowValidation" class="SlidingWindowValidation" expanded="yes">
        <parameter key="create_complete_model"	value="true"/>
        <parameter key="training_window_width"	value="120"/>
        <parameter key="training_window_step_size"	value="1"/>
        <parameter key="test_window_width"	value="60"/>
        <parameter key="horizon"	value="30"/>
        <operator name="LibSVMLearner" class="LibSVMLearner">
            <parameter key="svm_type"	value="epsilon-SVR"/>
            <parameter key="gamma"	value="0.0010"/>
            <parameter key="C"	value="10.0"/>
            <list key="class_weights">
            </list>
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="ForecastingPerformance" class="ForecastingPerformance">
                <parameter key="horizon"	value="30"/>
                <parameter key="keep_example_set"	value="true"/>
                <parameter key="prediction_trend_accuracy"	value="true"/>
            </operator>
            <operator name="ProcessLog" class="ProcessLog" breakpoints="after">
                <list key="log">
                  <parameter key="trend_accuracy"	value="operator.ForecastingPerformance.value.prediction_trend_accuracy"/>
                  <parameter key="performance1"	value="operator.SlidingWindowValidation.value.performance"/>
                  <parameter key="performance2"	value="operator.SlidingWindowValidation.value.performance2"/>
                </list>
            </operator>
        </operator>
    </operator>
    <operator name="ModelWriter" class="ModelWriter">
        <parameter key="model_file"	value="C:\Users\Excalibur\Desktop\hall1.mod"/>
        <parameter key="overwrite_existing_file"	value="false"/>
    </operator>
</operator>

If your task is to predict attendance tomorrow, given what you know today then your horizon should be 1; if on the other hand it is to predict what attendance will be thirty days from now, given again what you know now, then your horizon should be 30.

Intuitively you can see that the bigger the horizon the more 'out of date' the model will be; exactly the same also applies by having long test windows. In your case the last test example in each validation would be 30+60 examples after the last training example. In dynamically unstable environments this could also be significant.

PS. The more I think about your problem the more I ponder whether you really need sliding window validation at all. If a sunny Saturday in May is the same from year to year, in attendance terms, then you need not look at the problem as a time-series. You only need to look at the data as a series if it is essential to maintain the ordering of the examples.

steffen · January 2010

Hello

Here are some mixed thoughts:
With only having one non-label-attribute (date) as data (and derived features day and month), your amount of information for prediction is rather limited. You can of course add new features like...
--- was this day a holiday in the area where the sportcenter is located
--- what was the weather at this day / was there a long period of coldness ?
but this may not help you. The question is what can actually predicted with this data what is not already obvious and can be predicted by simple calculating the visit rate (by counting) for every month or season (and for every center). Where is the potential hideout for uncovered knowledge ?

One a side note: I guess that you learn the visit rate for each center separately, otherwise I would also include the sportcenter information into the dataset (different centers may have different base visit rates).

greetings,

steffen

dusankat · January 2010

Hello,

thanks for your help and suggestions. Maybe I didn't describe my problem clearly in first post, so let me explain it in more detail:

I would like to predict visit rate for various sport facilities (e.g. tenis, squash,gym) in a sport centre. I have also weather data (wind speed, temperature, rainfall) from sport centre location and I treat holiday as a Sunday in day attribute. Or should I add new binary attribute holiday?

I think I cannot predict visit rate by simply averaging visit rates on similar days in the past, because it is dependent on weather and possible other factors. If there wasn't any sunny day in first week of May, I wouldn't know how many people would come on some particular day in first week of May in case of sunny weather. I hope that with data mining I can unveil the relationship between weather, day and month and visit rate and predict how many people would come...

There are recurring patterns in data (e.g. every Monday visit rate is higher than in other days), so should I look at the problem as classification rather than time series? Should I convert my label to nominal and use some algorithm for classification? To reduce number of nominal values, I can use same value for a range - e.g. Visit_0_5 for visit rates 0-5, is in RM an operator, which could help me achieve this? Thanks.

Greetings,
Dusan

Visit rate prediction

Answers

Categories