Error Apply Model [SOLVED]
Danyo83
New Altair Community Member
Hi,
I have a classification problem wiht 2 classes. Unfortunately one cannot access the prediction label when using a Feature Selection process. So I saved the attributes weights and started a new process with the loaded weights. I applied the Model to the "unseen" testset and compared its performance with the performance of the FS Process which used the same weights. The performance of the applied Model is differs a lot to the testset performance of the FS process. Can you fix this bug and mabye offer a possibility to access the prediction label via the simple FS process.
Furthermore I have to report that when saving the Model in an XMl file or similar, and recalling it, the performance also differs a lot to the FS process Performance. Can you fix it?
Thanks in advance Daniel
I have a classification problem wiht 2 classes. Unfortunately one cannot access the prediction label when using a Feature Selection process. So I saved the attributes weights and started a new process with the loaded weights. I applied the Model to the "unseen" testset and compared its performance with the performance of the FS Process which used the same weights. The performance of the applied Model is differs a lot to the testset performance of the FS process. Can you fix this bug and mabye offer a possibility to access the prediction label via the simple FS process.
Furthermore I have to report that when saving the Model in an XMl file or similar, and recalling it, the performance also differs a lot to the FS process Performance. Can you fix it?
Thanks in advance Daniel
0
Answers
-
Hi Daniel,
can you please describe in detail what you are doing with the loaded weights, and how you are performing the Feature Selection? Most useful would be example processes.
Best,
Marius0 -
Hi Marius,
thank you very much for your reply. Unfortunately I just found out that you wrote me, maybe you can offer an automatic info mail.
Anyway I have a Feature Selection process (linear Split validation). I need the outcome of the prediction labels of the Testset, but unfortunately, this is currently not possible with RM. So I use the "save model" and "load modell" and "apply model" operators and perform the process again only on the testet in order to get the predicions which I need for further processes. The problem is, that the Model is not at all the same as I saved before.The classifiaction accuracy differ a lot, although the testset in the FS process and the applied testset in the loaded model are identical. Its the same problem as here:
http://rapid-i.com/rapidforum/index.php/topic,3438.msg16533.html#msg16533
Can I send you my process??
Thanks in advance and again sorry for the late reply.
Daniel
0 -
I forgot something. Since the process via save and load model did not work, I built another process via the operaters save and load attribute weights. The attribute weights are saved after the FS process and loaded when using the split validation with the same classifier. So the accuracy is the same as in the testset of the FS process. But still, the accuracy is not exactly the same as of the testset of the FS process but at learst similar. It is hard to describe it. I would appreciate to send you both processes.
Tanks in advance
Daniel
0 -
Hi Daniel,
you can post your processes here in the forum. Just open the process in RapidMiner, go to the XML tab on top of the process view and copy the xml code into your post, surrounding it with code tags via the "#" button above the input field here in the forum.
Best,
Marius0 -
Hi Marius,
this is the code. Instead of using the "store (model)" and "recall (model)" operators, one can also use the "write model" and "load model". Since I cannot directly access the prediction label (for the testset) of the Feature selection process, I need to save the built model after the FS process in order to load it and apply the model to the identical testdata. Since I cannot see the predicted label and see the performance evalutation at the same time, I need to do this process again, but this time with a perfomance evaluation operator at the end, to be able to compare the performance results of the FS process and the built and applied model. Acutally the performance should be the same since the testset data ius identical. But the results differ without any reason. I have checked it a hundred times. Do you have an explanation ???
P.S. Since the the maximum number of characters reached, I deleted some features in the code, but this shouldnt be a problem...<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Root">
<description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description>
<process expanded="true" height="995" width="846">
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="optimize_selection" compatibility="5.1.017" expanded="true" height="94" name="FS" width="90" x="179" y="30">
<parameter key="generations_without_improval" value="40"/>
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="keep_best" value="3"/>
<parameter key="normalize_weights" value="false"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="604" width="748">
<operator activated="true" class="split_validation" compatibility="5.1.017" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
<parameter key="split" value="absolute"/>
<parameter key="split_ratio" value="0.95"/>
<parameter key="training_set_size" value="2544"/>
<parameter key="test_set_size" value="260"/>
<parameter key="sampling_type" value="linear sampling"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="191" width="331">
<operator activated="true" class="naive_bayes" compatibility="5.1.017" expanded="true" height="76" name="Naive Bayes" width="90" x="148" y="30"/>
<connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="296" width="346">
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30">
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="correlation" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Applier" to_port="model"/>
<connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
<connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/>
<connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="remember" compatibility="5.1.017" expanded="true" height="60" name="Remember_Model" width="90" x="313" y="120">
<parameter key="name" value="Model_new"/>
<parameter key="io_object" value="Model"/>
</operator>
<operator activated="true" class="log" compatibility="5.1.017" expanded="true" height="76" name="ProcessLog" width="90" x="514" y="30">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_op="Remember_Model" to_port="store"/>
<connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/>
<connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="write_weights" compatibility="5.1.017" expanded="true" height="60" name="Write Weights" width="90" x="514" y="120">
<parameter key="attribute_weights_file" value="C:\Users\Node\daniel_att_weights.wgt"/>
</operator>
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV (2)" width="90" x="246" y="300">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="268" value="a268.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="recall" compatibility="5.1.017" expanded="true" height="60" name="Recall (2)_Model" width="90" x="246" y="210">
<parameter key="name" value="Model_new"/>
<parameter key="io_object" value="Model"/>
<parameter key="remove_from_store" value="false"/>
</operator>
<operator activated="true" class="read_weights" compatibility="5.1.017" expanded="true" height="60" name="AttributeWeightsLoader (3)" width="90" x="380" y="345">
<parameter key="attribute_weights_file" value="C:\Users\Node\daniel_att_weights.wgt"/>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="514" y="300"/>
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model" width="90" x="648" y="255">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV (3)" width="90" x="246" y="615">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="2" value="a2.true.real.attribute"/>
<parameter key="3" value="a3.true.real.attribute"/>
<parameter key="268" value="a268.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="recall" compatibility="5.1.017" expanded="true" height="60" name="Recall (3)_model" width="90" x="246" y="525">
<parameter key="name" value="Model_new"/>
<parameter key="io_object" value="Model"/>
<parameter key="remove_from_store" value="false"/>
</operator>
<operator activated="true" class="read_weights" compatibility="5.1.017" expanded="true" height="60" name="AttributeWeightsLoader (2)" width="90" x="380" y="705">
<parameter key="attribute_weights_file" value="C:\Users\Node\daniel_att_weights.wgt"/>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="447" y="570"/>
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="581" y="525">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_ungesehen" width="90" x="715" y="480">
<parameter key="classification_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="FS" to_port="example set in"/>
<connect from_op="FS" from_port="example set out" to_port="result 2"/>
<connect from_op="FS" from_port="weights" to_op="Write Weights" to_port="input"/>
<connect from_op="FS" from_port="performance" to_port="result 1"/>
<connect from_op="Write Weights" from_port="through" to_port="result 5"/>
<connect from_op="Read CSV (2)" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/>
<connect from_op="Recall (2)_Model" from_port="result" to_op="Apply Model" to_port="model"/>
<connect from_op="AttributeWeightsLoader (3)" from_port="output" to_op="AttributeWeightSelection (2)" to_port="weights"/>
<connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 3"/>
<connect from_op="Read CSV (3)" from_port="output" to_op="AttributeWeightSelection (3)" to_port="example set input"/>
<connect from_op="Recall (3)_model" from_port="result" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="AttributeWeightsLoader (2)" from_port="output" to_op="AttributeWeightSelection (3)" to_port="weights"/>
<connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/>
<connect from_op="Performance_ungesehen" from_port="performance" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>0 -
How much does the performance differ? Since one time you use a Validation operator and one time not, the performance does differ a bit, but should be within the same magnitude.
Btw, you can simplify your process a bit (see below).
Best, Marius
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Root">
<description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description>
<process expanded="true" height="995" width="846">
<operator activated="true" class="read_csv" compatibility="5.2.000" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="recall" compatibility="5.2.000" expanded="true" height="60" name="Recall (3)_model" width="90" x="45" y="210">
<parameter key="name" value="Model_new"/>
<parameter key="io_object" value="Model"/>
<parameter key="remove_from_store" value="false"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.2.000" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
<operator activated="true" class="optimize_selection" compatibility="5.2.000" expanded="true" height="94" name="FS" width="90" x="514" y="30">
<parameter key="generations_without_improval" value="40"/>
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="keep_best" value="3"/>
<parameter key="normalize_weights" value="false"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="604" width="748">
<operator activated="true" class="split_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
<parameter key="split" value="absolute"/>
<parameter key="split_ratio" value="0.95"/>
<parameter key="training_set_size" value="2544"/>
<parameter key="test_set_size" value="260"/>
<parameter key="sampling_type" value="linear sampling"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="191" width="331">
<operator activated="true" class="naive_bayes" compatibility="5.2.000" expanded="true" height="76" name="Naive Bayes" width="90" x="148" y="30"/>
<connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="296" width="346">
<operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30">
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="correlation" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Applier" to_port="model"/>
<connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
<connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/>
<connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="remember" compatibility="5.2.000" expanded="true" height="60" name="Remember_Model" width="90" x="313" y="120">
<parameter key="name" value="Model_new"/>
<parameter key="io_object" value="Model"/>
</operator>
<operator activated="true" class="log" compatibility="5.2.000" expanded="true" height="76" name="ProcessLog" width="90" x="514" y="30">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_op="Remember_Model" to_port="store"/>
<connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/>
<connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.2.000" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="380" y="255"/>
<operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210">
<parameter key="classification_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Recall (3)_model" from_port="result" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/>
<connect from_op="Multiply" from_port="output 2" to_op="AttributeWeightSelection (3)" to_port="example set input"/>
<connect from_op="FS" from_port="example set out" to_port="result 2"/>
<connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/>
<connect from_op="FS" from_port="performance" to_port="result 1"/>
<connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/>
<connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/>
<connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Marius,
thanks a lot. this really helps to make it easier. I only needed to add another CSV Reader (and unable the multiply operator), since the model applier should only be applied to the testset of the data not to the whole dataset which also includes the training data...
I thought that, since the testset is identical to set of the applied model, the performance should not differ right? The model is built after the validation process. how can it be, that the testset is not classified indentically? The performance accuracy sometimes differ only 3 % (67 vs. 64%) but somtimes it differs 22 % (68 vs 46%) The last is the case when the validation process proceeds a long time even if the performance does not improve for a long time. The strange thing is that the applied model predicted every datapoint into the same class, never into the other one (it is a 2 class case). That is why the accuracy is only 46% while the accuracy of the testset of the forward selection process has 68%.
It is really annoying that I cannot fix it.
Can you help me?
Thanks in advance
Daniel0 -
Hi Daniel,
I had another look at the process, and the way it was setup before it does not make sense. The Forward Selection executes its subprocess for many combinations of parameters, and there is no guarantee that the last execution takes place on the best feature set, and thus the last stored model is not necessarily the best. You have to output the weights, apply them on the training data and then create the final model. Then you can apply it on the weighted test data.
By the way, you should exchange you Split Validation with a X-Validation for more reliable results, even though it will take more time to run.
Best,
Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.001" expanded="true" name="Root">
<description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description>
<process expanded="true" height="539" width="768">
<operator activated="true" class="read_csv" compatibility="5.2.001" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="5.2.001" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
<operator activated="true" class="optimize_selection" compatibility="5.2.001" expanded="true" height="94" name="FS" width="90" x="380" y="30">
<parameter key="generations_without_improval" value="40"/>
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="keep_best" value="3"/>
<parameter key="normalize_weights" value="false"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="521" width="433">
<operator activated="true" class="split_validation" compatibility="5.2.001" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
<parameter key="split" value="absolute"/>
<parameter key="split_ratio" value="0.95"/>
<parameter key="training_set_size" value="2544"/>
<parameter key="test_set_size" value="260"/>
<parameter key="sampling_type" value="linear sampling"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="191" width="331">
<operator activated="true" class="naive_bayes" compatibility="5.2.001" expanded="true" height="76" name="Naive Bayes" width="90" x="148" y="30"/>
<connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="296" width="346">
<operator activated="true" class="apply_model" compatibility="5.2.001" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.2.001" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30">
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="correlation" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Applier" to_port="model"/>
<connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
<connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/>
<connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log" compatibility="5.2.001" expanded="true" height="76" name="ProcessLog" width="90" x="313" y="30">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/>
<connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.2.001" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="179" y="210"/>
<operator activated="true" class="naive_bayes" compatibility="5.2.001" expanded="true" height="76" name="Naive Bayes (2)" width="90" x="380" y="210"/>
<operator activated="true" class="read_csv" compatibility="5.2.001" expanded="true" height="60" name="Read Test Data" width="90" x="45" y="345">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.2.001" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="380" y="345"/>
<operator activated="true" class="apply_model" compatibility="5.2.001" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.2.001" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210">
<parameter key="classification_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/>
<connect from_op="Multiply" from_port="output 2" to_op="AttributeWeightSelection (3)" to_port="example set input"/>
<connect from_op="FS" from_port="example set out" to_port="result 2"/>
<connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/>
<connect from_op="FS" from_port="performance" to_port="result 1"/>
<connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="Naive Bayes (2)" to_port="training set"/>
<connect from_op="AttributeWeightSelection (3)" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/>
<connect from_op="Naive Bayes (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Read Test Data" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/>
<connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/>
<connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/>
<connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Marius,
thanks, that really makes sense. It unfortunately works for the Naive Bayes Classifier, but when I changed it to the Linear Discriminant Analysis the error still occurs. Unfortunately the accuracy of the testset of the Forward Selection process is 71,15% while the accuracy of the applied model onto the identical dataset is 46,15% (all the labeled data is classified into the same class). the selected attributes are the same, so this is not the underlying error...
I really have no idea how this can occur
Here is the process<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Root">
<description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description>
<process expanded="true" height="539" width="768">
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="2" value="a2.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="5.1.017" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
<operator activated="true" class="optimize_selection" compatibility="5.1.017" expanded="true" height="94" name="FS" width="90" x="380" y="30">
<parameter key="generations_without_improval" value="40"/>
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="keep_best" value="3"/>
<parameter key="maximum_number_of_generations" value="80"/>
<parameter key="normalize_weights" value="false"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="521" width="433">
<operator activated="true" class="split_validation" compatibility="5.1.017" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
<parameter key="split" value="absolute"/>
<parameter key="split_ratio" value="0.95"/>
<parameter key="training_set_size" value="2544"/>
<parameter key="test_set_size" value="260"/>
<parameter key="sampling_type" value="linear sampling"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="191" width="331">
<operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA" width="90" x="136" y="30"/>
<connect from_port="training" to_op="LDA" to_port="training set"/>
<connect from_op="LDA" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="296" width="346">
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30">
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="correlation" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Applier" to_port="model"/>
<connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
<connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/>
<connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log" compatibility="5.1.017" expanded="true" height="76" name="ProcessLog" width="90" x="313" y="30">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/>
<connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="179" y="210"/>
<operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA (2)" width="90" x="346" y="210"/>
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read Test Data" width="90" x="45" y="345">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="2" value="a2.true.real.attribute"/>
<parameter key="3" value="a3.true.real.attribute"/>
<parameter key="267" value="a267.true.real.attribute"/>
<parameter key="268" value="a268.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="380" y="345"/>
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210">
<parameter key="classification_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/>
<connect from_op="Multiply" from_port="output 2" to_op="AttributeWeightSelection (3)" to_port="example set input"/>
<connect from_op="FS" from_port="example set out" to_port="result 2"/>
<connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/>
<connect from_op="FS" from_port="performance" to_port="result 1"/>
<connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="LDA (2)" to_port="training set"/>
<connect from_op="AttributeWeightSelection (3)" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/>
<connect from_op="LDA (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Read Test Data" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/>
<connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/>
<connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/>
<connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>0 -
I just saw that your Validation is set to "linear sampling". That means that it uses the first X examples for training and the others for testing. If your data is sorted somehow, the wrong distribution is used for training and testing. You should switch the sampling mode to stratified sampling. That way it is guaranteed that the class ratio of positive and negative examples is identical in training and test set.
The outer Model training does not suffer from that problem, since it uses unsampled data. But even after that fix the performances won't be exactly the same, because the outer Train/Apply combination uses the whole dataset both for training and for testing, whereas the FS uses only a part of the data for training and the other part for testing.
Btw, in the log operator you should log the "peformance" of the Validation, not of the FS.
And I still suggest urgently to exchange the Split Validation with a X-Validation
Best, Marius0 -
The CSV files aren't the same. The first CSV file which is used for the FS comprises the training and the testset. I need the the linear sampling since it need to be in orden. the first 2544 points need to be the training set and the following 260 points need to be the testset. So it mustn't be mixed. Since I cannot directly access the prediciton label of the testset (via the FS process), I build up the model applier in order to be able to access the prediction label.
The 2. CSV file therefore only comprises the testset, hence the 260 datapoints. That is why I think that the classification performance of the testset of the FS process should be at least nearly the same as the performance of the 2. process.
0 -
In that case you need a Filter Example Range operator in front of the outer LDA to select only the first 2544 examples.0
-
Hi Marius,
thanks a lot. Now it works without any difference in both classification accuracies. I have implemented the mentioned Filter and I have put the process log operator into the Validation process.
Is this code correct?<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Root">
<description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description>
<process expanded="true" height="539" width="768">
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="5.1.017" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
<operator activated="true" class="optimize_selection" compatibility="5.1.017" expanded="true" height="94" name="FS" width="90" x="380" y="30">
<parameter key="generations_without_improval" value="40"/>
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="maximum_number_of_generations" value="80"/>
<parameter key="normalize_weights" value="false"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="521" width="433">
<operator activated="true" class="split_validation" compatibility="5.1.017" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
<parameter key="split" value="absolute"/>
<parameter key="split_ratio" value="0.95"/>
<parameter key="training_set_size" value="2544"/>
<parameter key="test_set_size" value="260"/>
<parameter key="sampling_type" value="linear sampling"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="191" width="331">
<operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA" width="90" x="136" y="30"/>
<connect from_port="training" to_op="LDA" to_port="training set"/>
<connect from_op="LDA" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="296" width="480">
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30">
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="correlation" value="true"/>
<list key="class_weights"/>
</operator>
<operator activated="true" class="log" compatibility="5.1.017" expanded="true" height="76" name="ProcessLog" width="90" x="313" y="30">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
</operator>
<connect from_port="model" to_op="Applier" to_port="model"/>
<connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
<connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/>
<connect from_op="Performance_Validation" from_port="performance" to_op="ProcessLog" to_port="through 1"/>
<connect from_op="ProcessLog" from_port="through 1" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read Test Data" width="90" x="45" y="345">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="2" value="a2.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="filter_example_range" compatibility="5.1.017" expanded="true" height="76" name="Filter Example Range" width="90" x="45" y="210">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="2544"/>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="179" y="210"/>
<operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA (2)" width="90" x="346" y="210"/>
<operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="380" y="345"/>
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210">
<parameter key="classification_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/>
<connect from_op="Multiply" from_port="output 2" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="FS" from_port="example set out" to_port="result 2"/>
<connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/>
<connect from_op="FS" from_port="performance" to_port="result 1"/>
<connect from_op="Read Test Data" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_op="AttributeWeightSelection (3)" to_port="example set input"/>
<connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="LDA (2)" to_port="training set"/>
<connect from_op="AttributeWeightSelection (3)" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/>
<connect from_op="LDA (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/>
<connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/>
<connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>0 -
Yes, it seems to be fine. Just for the Log operator you got me wrong: it should stay at the FS, but in its configuration log the performance of the validation, as below.
Best, Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Root">
<description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description>
<process expanded="true" height="539" width="768">
<operator activated="true" class="read_csv" compatibility="5.2.000" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="5.2.000" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
<operator activated="true" class="optimize_selection" compatibility="5.2.000" expanded="true" height="94" name="FS" width="90" x="380" y="30">
<parameter key="generations_without_improval" value="40"/>
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="maximum_number_of_generations" value="80"/>
<parameter key="normalize_weights" value="false"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="521" width="681">
<operator activated="true" class="split_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
<parameter key="split" value="absolute"/>
<parameter key="split_ratio" value="0.95"/>
<parameter key="training_set_size" value="2544"/>
<parameter key="test_set_size" value="260"/>
<parameter key="sampling_type" value="linear sampling"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="191" width="331">
<operator activated="true" class="linear_discriminant_analysis" compatibility="5.2.000" expanded="true" height="76" name="LDA" width="90" x="136" y="30"/>
<connect from_port="training" to_op="LDA" to_port="training set"/>
<connect from_op="LDA" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="296" width="480">
<operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30">
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="correlation" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Applier" to_port="model"/>
<connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
<connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/>
<connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log" compatibility="5.2.000" expanded="true" height="76" name="ProcessLog" width="90" x="447" y="30">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.Validation.value.performance"/>
</list>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/>
<connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="read_csv" compatibility="5.2.000" expanded="true" height="60" name="Read Test Data" width="90" x="45" y="345">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="2" value="a2.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="filter_example_range" compatibility="5.2.000" expanded="true" height="76" name="Filter Example Range" width="90" x="45" y="210">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="2544"/>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.2.000" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="179" y="210"/>
<operator activated="true" class="linear_discriminant_analysis" compatibility="5.2.000" expanded="true" height="76" name="LDA (2)" width="90" x="346" y="210"/>
<operator activated="true" class="select_by_weights" compatibility="5.2.000" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="380" y="345"/>
<operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210">
<parameter key="classification_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/>
<connect from_op="Multiply" from_port="output 2" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="FS" from_port="example set out" to_port="result 2"/>
<connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/>
<connect from_op="FS" from_port="performance" to_port="result 1"/>
<connect from_op="Read Test Data" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_op="AttributeWeightSelection (3)" to_port="example set input"/>
<connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="LDA (2)" to_port="training set"/>
<connect from_op="AttributeWeightSelection (3)" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/>
<connect from_op="LDA (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/>
<connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/>
<connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>0 -
Thanks, you have helped me a lot. Since the FS and the split validation optimize the accuracy regarding the testset and not with only regard to the training set, it is biased, since for my research, I really need the testset to be unknown. is there any other possibilty then using Xvalidation or split validation? Of course I could follow your advice using Xvalidation. I would take the original training set of 2544 datapoints for the Xvalidation (10 fold stratified, so that apriori, the distribution and the probability of each class is 50:50). Then I would give the selected attributes weigts to the outer model in order to apply it to the unseen set of 260 datapoints. So I can guarantee that the 260 datapoints were never be seen and used for optimization. Is this the most reliable approach or do you have any other proposition?
Is this code correct for this proceeding?<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Root">
<description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description>
<process expanded="true" height="539" width="768">
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="2" value="a2.true.real.attribute"/>
<parameter key="3" value="a3.true.real.attribute"/>
<parameter key="4" value="a4.true.real.attribute"/>
<parameter key="267" value="a267.true.real.attribute"/>
<parameter key="268" value="a268.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="5.1.017" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read Test Data" width="90" x="45" y="345">
<parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="2" value="a2.true.real.attribute"/>
<parameter key="3" value="a3.true.real.attribute"/>
<parameter key="268" value="a268.true.real.attribute"/>
<parameter key="269" value="a269.true.real.attribute"/>
<parameter key="270" value="a270.true.integer.attribute"/>
<parameter key="271" value="a271.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="filter_example_range" compatibility="5.1.017" expanded="true" height="76" name="Filter Example Range" width="90" x="45" y="210">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="2544"/>
</operator>
<operator activated="true" class="filter_example_range" compatibility="5.1.017" expanded="true" height="76" name="Filter Example Range (2)" width="90" x="313" y="30">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="2544"/>
</operator>
<operator activated="true" class="optimize_selection" compatibility="5.1.017" expanded="true" height="94" name="FS" width="90" x="514" y="30">
<parameter key="generations_without_improval" value="40"/>
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="maximum_number_of_generations" value="80"/>
<parameter key="normalize_weights" value="false"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="521" width="480">
<operator activated="true" class="x_validation" compatibility="5.1.017" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="258" width="353">
<operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA" width="90" x="136" y="30"/>
<connect from_port="training" to_op="LDA" to_port="training set"/>
<connect from_op="LDA" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="258" width="353">
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30">
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="correlation" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Applier" to_port="model"/>
<connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
<connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/>
<connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log" compatibility="5.1.017" expanded="true" height="76" name="ProcessLog" width="90" x="380" y="75">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.Validation.value.performance"/>
</list>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/>
<connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="179" y="210"/>
<operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA (2)" width="90" x="346" y="210"/>
<operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="380" y="345"/>
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210">
<parameter key="classification_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Filter Example Range (2)" to_port="example set input"/>
<connect from_op="Read Test Data" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_op="AttributeWeightSelection (3)" to_port="example set input"/>
<connect from_op="Filter Example Range (2)" from_port="example set output" to_op="FS" to_port="example set in"/>
<connect from_op="FS" from_port="example set out" to_port="result 2"/>
<connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/>
<connect from_op="FS" from_port="performance" to_port="result 1"/>
<connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="LDA (2)" to_port="training set"/>
<connect from_op="AttributeWeightSelection (3)" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/>
<connect from_op="LDA (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/>
<connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/>
<connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>0 -
Yes, your process looks fine. One word on the stratification in the X-Validation: stratification in RapidMiner means to leave the class priors untouched, not to create a class distribution of 50:50. So the stratified sampling only guarantees a class distribution of 50:50 if your training data contains that distribution
Best, Marius0 -
Thanks Marius,
since I really need the testset to be unseen and the split validation optimizes with regard to the testset, is there a possibility to only train and optimize the training set and AFTER haveing optimized it, apply it to the unseen testset. Because if not, I cheat on myself since the testset is not really unseen...
0 -
In your last process I can't find a Split Validation, and if I see it corectly, the first Filter Example Range filters the data such that the Feature Selection only sees your training data and doesn't get a glimpse of your test data. Or do I miss anything?
All the best,
Marius0 -
You are right. The last model applies Xvalidation. But the problem is that it takes too much time for the process. So I was wondering if there is a possibility like Split validation, but where the model can only be optimized via the trainingset and then applying to the unseen testset. Is it possible?
I mean, via Split validation the accuracy onto the testset is enormous but it is optimized with regard to the testset and since I really need the testset to be unseen, I would cheat on myself...0 -
Sure, you could also use a split validation on the training set only, just replace the X-Validation with the Split Validation. However, the X-Validation will deliver more accurate results for the performance. If a 10-fold X-Val is too slow for you, you could reduce the number of iterations... but of course the Split Validation will also do the job.0
-
Ok thanks,
btw what is the algorithm behind the feature selection? If I use forward selection it starts with an empty feature space but how does the algorithm choose the next feature(s) for the following generation? Greedy, hill climbing, random? How does the "keep best" work? Does it add a certain number of random features and keep the best x of it or how does it work?0 -
Forward Feature Selection adds exactly one attribute in each generation. It is chosen by the following algorithm:
1. for all remaining features:
add the attribute to the set of features
evaluate the performance
Remove the feature, continue with the next one
2. add the feature with the best performance to the feature set
3. continue with 1. until no features are left, or the maximum number of features is reached.
Best,
Marius
0 -
Many thanks Marius,
great job!0