How to forecast and improve model simultaneously
Hello!
I’m new to Data Science and RM.
I am asking for some help in the following task. I am building a model, that would forecast energy consumption for every day. I have a lot of training data and I have already prepared input parameters of one month of test data. Because test data are from past, I also have the exact energy consumption figures for the whole month. So, I would like to validate my model, based on this test data.
Is there any function in RapidMiner that would predict energy consumption for the first day of the month, then take the exact consumption figure from an additional file and use it as a training data and after that predict energy consumption for the second day of the month? Then again, take the exact consumption for second day, use it as a training data and predict consumption for day three of the month, and again, and again, for the whole month.
What I actually need is an algorithm that would predict, then learn from some extra information (not previously known) and train again, repeat this whole task again.
I would appreciate some good advice, thank you in advance!
Best Answer
-
Hello @gp3354,
Welcome to the RapidMiner Community!
I am willing to help you but the scenario you describe can have a lot of variables. Hence, I sat down and made the experiment for myself. This is what I could come up with.
When I sit down to work with RapidMiner on a forecasting model, I write down the question: "what will be my energy consumption forecast for today?" is a great beginning. Then I look at the data I have: you prepared it already, and that's great too. Now, where is your data stored? There are three (well, there are more, but let's focus in the simple ones) possibilities:
- Spreadsheet files.
- RapidMiner IOObjects.
- An SQL database.
If you have your data in spreadsheet files, it will be more difficult to keep these updated, as there is always the possibility to hit the "Play" button twice. I recommend you to store you data in either a RapidMiner object or an SQL database.
Your flow would be something like:
- Retrieve past training data using the Retrieve operator. (A month)
- Retrieve recently labeled data, also using the Retrieve operator (Yesterday)
- Prepare your labeled data to have the same structure as the past training data. (Select Attributes, Set Role, Generate Attributes, Rename and so on... there are many more operators for data preparation but if you kept your data simple, these are the ones I would take a look at)
- Join both example sets to form the new training data using the Join operator.
- Remove the recently labeled data, so that it doesn't get duplicated (there is a Remove Example Set operator).
- Use your new training data (the result of the join) to train your algorithm (I don't know what algorithms are you using).
- Retrieve the unlabeled data (Today). If you have more data, you might want to filter examples at this point. It doesn't matter if you have this one on files, since you are just reading that data. At this point, I think you know it's either Retrieve, Read Excel, Read CSV or Read Database.
- Apply the model to your algorithm. (Apply Model, that was easy!)
- Store the results as the new labeled data with the Store operator.
- Remove the unlabeled data (or mark it in some way so that you can filter it avoiding RapidMiner to consume it again). I can't help you much with this, as I don't know where you store your data in the first place.
- You are ready for the day. The next day, the process will be the same: retrieve past training data...
I don't know of any function in RapidMiner that would do this recursively for you, except for a creative case of the Split Validation algorithm, maybe. But since you are a learner, I would refrain to go that route until you are confident.
Now, your second question: you want to validate and optimize your data before running it. That's wise from you, congratulations! That can be done with the Cross Validation operator (since you have data from only a month, you want to get the best from it.
Remember the step where I told you to use your new training data to train your algorithm? You can either use the Multiply operator to perform a Cross Validation or train your data inside the Cross Validation. I sense that the first one is better for your goals, but nothing better than experimentation.
Now, I don't have RapidMiner Studio on this computer, so I can't build an example for you but will happily check your XML if you are in doubt.
Hope it helps,
5
Answers
-
Hello @gp3354,
Welcome to the RapidMiner Community!
I am willing to help you but the scenario you describe can have a lot of variables. Hence, I sat down and made the experiment for myself. This is what I could come up with.
When I sit down to work with RapidMiner on a forecasting model, I write down the question: "what will be my energy consumption forecast for today?" is a great beginning. Then I look at the data I have: you prepared it already, and that's great too. Now, where is your data stored? There are three (well, there are more, but let's focus in the simple ones) possibilities:
- Spreadsheet files.
- RapidMiner IOObjects.
- An SQL database.
If you have your data in spreadsheet files, it will be more difficult to keep these updated, as there is always the possibility to hit the "Play" button twice. I recommend you to store you data in either a RapidMiner object or an SQL database.
Your flow would be something like:
- Retrieve past training data using the Retrieve operator. (A month)
- Retrieve recently labeled data, also using the Retrieve operator (Yesterday)
- Prepare your labeled data to have the same structure as the past training data. (Select Attributes, Set Role, Generate Attributes, Rename and so on... there are many more operators for data preparation but if you kept your data simple, these are the ones I would take a look at)
- Join both example sets to form the new training data using the Join operator.
- Remove the recently labeled data, so that it doesn't get duplicated (there is a Remove Example Set operator).
- Use your new training data (the result of the join) to train your algorithm (I don't know what algorithms are you using).
- Retrieve the unlabeled data (Today). If you have more data, you might want to filter examples at this point. It doesn't matter if you have this one on files, since you are just reading that data. At this point, I think you know it's either Retrieve, Read Excel, Read CSV or Read Database.
- Apply the model to your algorithm. (Apply Model, that was easy!)
- Store the results as the new labeled data with the Store operator.
- Remove the unlabeled data (or mark it in some way so that you can filter it avoiding RapidMiner to consume it again). I can't help you much with this, as I don't know where you store your data in the first place.
- You are ready for the day. The next day, the process will be the same: retrieve past training data...
I don't know of any function in RapidMiner that would do this recursively for you, except for a creative case of the Split Validation algorithm, maybe. But since you are a learner, I would refrain to go that route until you are confident.
Now, your second question: you want to validate and optimize your data before running it. That's wise from you, congratulations! That can be done with the Cross Validation operator (since you have data from only a month, you want to get the best from it.
Remember the step where I told you to use your new training data to train your algorithm? You can either use the Multiply operator to perform a Cross Validation or train your data inside the Cross Validation. I sense that the first one is better for your goals, but nothing better than experimentation.
Now, I don't have RapidMiner Studio on this computer, so I can't build an example for you but will happily check your XML if you are in doubt.
Hope it helps,
5 -
Hello rfuentealba,
thank you for the elaborate answer, you're amazing.
I was hoping there might be a built-in function that would solve that problem recursively, but your answer was helpful anyway.
0 -
Hi @gp3354!
Glad it helped.
A little while after I replied, I thought about something else that you should take in consideration. As I don't know what your data looks like, I'll make something up to explain my point.
Let's say this is your data:
Monday, 101kw
Tuesday, 97kw
Wednesday, 98kw
Thursday, 94kw
Friday, 104kw
Saturday, 119kw
Sunday, 93kw.
Let's say you apply a decision tree (I don't care about the algorithm, so I chose this to make it easy), and that since it's Monday, the decision tree is confident that your consumption will be 101kw...
If you put this as your new data, it's ok, but... what if on that Monday, your brother appeared at home with some beers to watch a soccer game, your neighbour asked you if she could use your laundry machine, and you used the coffee machine more than what was expected because you couldn't sleep? That would result in having more than the 101kw you predicted yet you are still reinforcing your algorithm with your prediction data instead of using your new data that may be different. Evaluate if what you want is to use the prediction or the outcome and fix appropriately, if you find it ok.
Never forget this rule (I forget it more often than not): Machine Learning isn't about forecasting the future but about using data to drive your decision making, by creating a mathematical idea of what will happen if the behaviour you are studying continues. I guess you already know how to use the operators I sent you, these are enough to solve this minor inconvenient.
All the best,
Rodrigo.
1 -
Thank you so much Mr. rfuentealba .
It would be really helpful to the whole community if you share the xml version of that algorithm. I'll be grateful for your support.
Best Regards.
0 -
Hello @puserc,
Have you solved your problem? I haven't been in front of a computer in the past days, if you want I can send it tomorrow.
Best regards,1 -
Hello everyone,
No Mr rfuentealba , I couldn't create it.
I would be grateful if you send it to me as you said .
Thank you so much.
Best regards.
0 -
Hello @puserc
Please find attached. There are three important processes:
02 Predict contains just the executable prediction and works as follows:
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="productivity:execute_process" compatibility="8.2.001" expanded="true" height="68" name="Generate Unlabeled" width="90" x="45" y="136">
<parameter key="process_location" value="02-2 Generate Unlabeled Data"/>
<list key="macros"/>
</operator>
<operator activated="true" class="productivity:execute_process" compatibility="8.2.001" expanded="true" height="82" name="Generate Prediction" width="90" x="45" y="34">
<parameter key="process_location" value="02-1 Generate Prediction"/>
<list key="macros"/>
</operator>
<operator activated="true" class="apply_model" compatibility="8.2.001" expanded="true" height="82" name="Apply Model" width="90" x="179" y="85">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="union" compatibility="8.2.001" expanded="true" height="82" name="Union" width="90" x="313" y="34"/>
<operator activated="true" class="store" compatibility="8.2.001" expanded="true" height="68" name="Store" width="90" x="447" y="34">
<parameter key="repository_entry" value="Consumption Training"/>
</operator>
<connect from_op="Generate Unlabeled" from_port="result 1" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Generate Prediction" from_port="result 1" to_op="Union" to_port="example set 1"/>
<connect from_op="Generate Prediction" from_port="result 2" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Union" to_port="example set 2"/>
<connect from_op="Union" from_port="union" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>02-1 Generate Prediction helps updating historical information with recently scored information (a very rudimentary thing).
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve Consumption Training" width="90" x="112" y="391">
<parameter key="repository_entry" value="Consumption Training"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Only Labeled" width="90" x="246" y="391">
<list key="filters_list">
<parameter key="filters_entry_key" value="Level.is_not_missing."/>
</list>
</operator>
<operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve Consumption" width="90" x="45" y="85">
<parameter key="repository_entry" value="Consumption"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Only with KwH" width="90" x="179" y="85">
<list key="filters_list">
<parameter key="filters_entry_key" value="KwH.is_not_missing."/>
</list>
<description align="center" color="transparent" colored="false" width="126">I want to learn if my last prediction was good or not</description>
</operator>
<operator activated="true" class="generate_attributes" compatibility="8.2.001" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="85">
<list key="function_descriptions">
<parameter key="Level" value="if([KwH]<=35,"Base",if([KwH]<=55,"Low",if([KwH]<=75,"Normal",if([KwH]<=95,"High","Too High"))))"/>
<parameter key="Day" value="date_str_custom(Date, "E")"/>
</list>
<description align="center" color="transparent" colored="false" width="126">Seasonality by day of the week and properly labeling</description>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.001" expanded="true" height="82" name="Set Role" width="90" x="447" y="85">
<parameter key="attribute_name" value="Level"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles">
<parameter key="Date" value="id"/>
</list>
<description align="center" color="transparent" colored="false" width="126">Properly labeling</description>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.001" expanded="true" height="82" name="Set ID on Training" width="90" x="380" y="391">
<parameter key="attribute_name" value="Date"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.001" expanded="true" height="103" name="Multiply" width="90" x="514" y="391"/>
<operator activated="true" breakpoints="after" class="set_minus" compatibility="8.2.001" expanded="true" height="82" name="Set Minus" width="90" x="648" y="85"/>
<operator activated="true" class="union" compatibility="8.2.001" expanded="true" height="82" name="Union" width="90" x="916" y="340"/>
<operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="1050" y="340">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Level|Day|Date"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.001" expanded="true" height="103" name="Multiply (2)" width="90" x="1184" y="340"/>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="8.2.001" expanded="true" height="103" name="Decision Tree" width="90" x="1452" y="85">
<parameter key="maximal_depth" value="5"/>
<parameter key="apply_pruning" value="false"/>
<parameter key="apply_prepruning" value="false"/>
</operator>
<operator activated="true" class="apply_model" compatibility="8.2.001" expanded="true" height="82" name="Apply Model" width="90" x="1653" y="238">
<list key="application_parameters"/>
<description align="center" color="transparent" colored="false" width="126">I want prediction vs reality, don't I?</description>
</operator>
<operator activated="true" class="store" compatibility="8.2.001" expanded="true" height="68" name="Store" width="90" x="1787" y="85">
<parameter key="repository_entry" value="Consumption Training"/>
</operator>
<connect from_op="Retrieve Consumption Training" from_port="output" to_op="Only Labeled" to_port="example set input"/>
<connect from_op="Only Labeled" from_port="example set output" to_op="Set ID on Training" to_port="example set input"/>
<connect from_op="Retrieve Consumption" from_port="output" to_op="Only with KwH" to_port="example set input"/>
<connect from_op="Only with KwH" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Set Minus" to_port="example set input"/>
<connect from_op="Set ID on Training" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Set Minus" to_port="subtrahend"/>
<connect from_op="Multiply" from_port="output 2" to_op="Union" to_port="example set 2"/>
<connect from_op="Set Minus" from_port="example set output" to_op="Union" to_port="example set 1"/>
<connect from_op="Union" from_port="union" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Multiply (2)" from_port="output 2" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Store" to_port="input"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<connect from_op="Store" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="84"/>
<portSpacing port="sink_result 2" spacing="42"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="center" color="yellow" colored="false" height="317" resized="true" width="821" x="10" y="10">Integrating my data from &quot;yesterday&quot; to the scoring algorithm.</description>
<description align="center" color="orange" colored="true" height="384" resized="true" width="820" x="11" y="333">This is my historical data. I use it creatively to filter the data from &quot;yesterday&quot; that I already have predicted and scored with the current values.</description>
<description align="center" color="red" colored="true" height="495" resized="true" width="540" x="834" y="10">Mixing data from &quot;yesterday&quot; and from &quot;history&quot;.<br/><br/>At this point, both data objects have the same structure, except for the prediction.</description>
<description align="center" color="purple" colored="true" height="493" resized="true" width="577" x="1376" y="10">Predictive Algorithm and Storage: notice that I store data with the same structure.</description>
</process>
</operator>
</process>The 02-2 Generate Unlabeled Data is just filters and negational queries. Everytime you execute your algorithm, your predictions for the future "improve".
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve Consumption Training" width="90" x="45" y="187">
<parameter key="repository_entry" value="Consumption Training"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Filter Examples (2)" width="90" x="179" y="187">
<list key="filters_list">
<parameter key="filters_entry_key" value="Level.is_in.Base;High;Low;Normal;Too High"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="187">
<parameter key="attribute_name" value="Date"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve Consumption" width="90" x="45" y="85">
<parameter key="repository_entry" value="Consumption"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="85">
<parameter key="attribute_name" value="Date"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="set_minus" compatibility="8.2.001" expanded="true" height="82" name="Set Minus" width="90" x="447" y="136"/>
<operator activated="true" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="136">
<list key="filters_list">
<parameter key="filters_entry_key" value="KwH.is_missing."/>
</list>
</operator>
<operator activated="true" class="generate_attributes" compatibility="8.2.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="136">
<list key="function_descriptions">
<parameter key="Day" value="date_str_custom(Date, "E")"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="849" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Day|Date"/>
</operator>
<connect from_op="Retrieve Consumption Training" from_port="output" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Set Minus" to_port="subtrahend"/>
<connect from_op="Retrieve Consumption" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Set Minus" to_port="example set input"/>
<connect from_op="Set Minus" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="84"/>
<portSpacing port="sink_result 2" spacing="21"/>
<description align="center" color="yellow" colored="false" height="297" resized="true" width="662" x="24" y="36">Get only the data that hasn't already been scored by feature generation.</description>
<description align="center" color="orange" colored="true" height="290" resized="true" width="265" x="691" y="37">We just need a few parameters to perform scoring.</description>
</process>
</operator>
</process>This process was way more complex than what I described. I am pretty sure it can be improved an awful lot, but at least you will have something to work with.
All the best,
Rodrigo.
2 -
2