Problem applying model to unseen data
dataminer99
New Altair Community Member
First off, I'm new to Rapid Miner having used version 4.6 extensively over the holiday and now converting all my work to RC 5. Kudos to the RM team for a great product!
I built a classification model in RC 5 but am now having a problem scoring an unseen dataset. I split my original data into 2; one dataset with 70% of the data for training / testing. A second dataset with 30% of the data for validation. I used FeatureSelection to determine and write out weights then learned a NeuralNet model using XValidation and wrote the model to a file. Happy with the training / testing results I wanted to validate my model on the unseen 30%. So I loaded and applied the weights to the unseen data then loaded and applied the model. However after I applied the model, the output doesn't show a predicted variable. The ClassificationPerformance operator errors telling me I need a predicted label yet I can't seem to get the Model Applier to add one. Is this a bug or am I doing something wrong? Any help appreciated.
I built a classification model in RC 5 but am now having a problem scoring an unseen dataset. I split my original data into 2; one dataset with 70% of the data for training / testing. A second dataset with 30% of the data for validation. I used FeatureSelection to determine and write out weights then learned a NeuralNet model using XValidation and wrote the model to a file. Happy with the training / testing results I wanted to validate my model on the unseen 30%. So I loaded and applied the weights to the unseen data then loaded and applied the model. However after I applied the model, the output doesn't show a predicted variable. The ClassificationPerformance operator errors telling me I need a predicted label yet I can't seem to get the Model Applier to add one. Is this a bug or am I doing something wrong? Any help appreciated.
Tagged:
0
Answers
-
Hi,
thank you for your kind words.
Anyway this should be possible. Could you please post your process here? I will then take a look at it. If it's possible, please exchange your data by a data generator or a sample data, so that I can easily execute the process.
Greetings,
Sebastian0 -
Thanks for your help Sebastian!
Below I've pasted XML code of the 3 processes I created in which I've replaced my datasets with dataset generators; 1) FeatureSelection, 2) NeuralNet, 3) ClassificationPerformance scoring. One difference I noticed is that I still get the error "Cannot check whether input example set has special attribute prediction" in step 3...however the process completes and generates the PerformanceVector. Using my real data the process does not do that...it stops because of the error. Any / all help appreciated! Thanks! Mike
1) Feature Selection
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Root">
<description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description>
<process expanded="true" height="391" width="570">
<operator activated="true" class="generate_direct_mailing_data" expanded="true" height="60" name="Generate Direct Mailing Data" width="90" x="45" y="30">
<parameter key="number_examples" value="3000"/>
</operator>
<operator activated="true" class="nominal_to_numerical" expanded="true" height="94" name="Nominal to Numerical" width="90" x="315" y="30">
<parameter key="attributes" value="taker"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="450" y="30">
<list key="columns"/>
</operator>
<operator activated="true" class="normalize" expanded="true" height="94" name="Normalize" width="90" x="45" y="120">
<parameter key="method" value="range transformation"/>
</operator>
<operator activated="true" class="optimize_selection" expanded="true" height="94" name="FS" width="90" x="180" y="120">
<parameter key="show_population_plotter" value="true"/>
<parameter key="constraint_draw_range" value="true"/>
<process expanded="true" height="604" width="415">
<operator activated="true" class="x_validation" expanded="true" height="112" name="XValidation" width="90" x="45" y="30">
<process expanded="true" height="604" width="165">
<operator activated="true" class="k_nn" expanded="true" height="76" name="NearestNeighbors" width="90" x="45" y="30">
<parameter key="k" value="5"/>
</operator>
<connect from_port="training" to_op="NearestNeighbors" to_port="training set"/>
<connect from_op="NearestNeighbors" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="604" width="300">
<operator activated="true" class="apply_model" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="180" y="30"/>
<connect from_port="model" to_op="Applier" to_port="model"/>
<connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
<connect from_op="Applier" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log" expanded="true" height="76" name="ProcessLog" width="90" x="180" y="30">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
</operator>
<connect from_port="example set" to_op="XValidation" to_port="training"/>
<connect from_op="XValidation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/>
<connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="write_weights" expanded="true" height="60" name="Write Weights" width="90" x="313" y="210">
<parameter key="attribute_weights_file" value="L:\Mike\Strategy\Dayton\RM Dayton Prospect Model\Sample\Sample Prospect Weights.wgt"/>
</operator>
<connect from_op="Generate Direct Mailing Data" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="FS" to_port="example set in"/>
<connect from_op="FS" from_port="example set out" to_port="result 1"/>
<connect from_op="FS" from_port="weights" to_op="Write Weights" to_port="input"/>
<connect from_op="FS" from_port="performance" to_port="result 2"/>
<connect from_op="Write Weights" from_port="through" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
2) NeuralNet
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="440" width="840">
<operator activated="true" class="read_weights" expanded="true" height="60" name="Read Weights" width="90" x="45" y="120">
<parameter key="attribute_weights_file" value="L:\Mike\Strategy\Dayton\RM Dayton Prospect Model\Sample\Sample Prospect Weights.wgt"/>
</operator>
<operator activated="true" class="generate_direct_mailing_data" expanded="true" height="60" name="Generate Direct Mailing Data" width="90" x="45" y="30">
<parameter key="number_examples" value="3000"/>
</operator>
<operator activated="true" class="select_by_weights" expanded="true" height="94" name="Select by Weights" width="90" x="246" y="30">
<parameter key="weight_relation" value="equals"/>
</operator>
<operator activated="true" class="nominal_to_numerical" expanded="true" height="94" name="Nominal to Numerical" width="90" x="514" y="30">
<parameter key="attributes" value="taker|sampling|region|lifestage|dwellingtype|dontsms|dontmail|dontemail|dncfederal|dnccbw|dnccbt|directvv|category"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="45" y="210">
<list key="columns"/>
</operator>
<operator activated="true" class="normalize" expanded="true" height="94" name="Normalize" width="90" x="179" y="210">
<parameter key="method" value="range transformation"/>
</operator>
<operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="313" y="210">
<process expanded="true" height="389" width="212">
<operator activated="true" class="neural_net" expanded="true" height="76" name="Neural Net" width="90" x="45" y="30">
<list key="hidden_layers"/>
</operator>
<connect from_port="training" to_op="Neural Net" to_port="training set"/>
<connect from_op="Neural Net" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="389" width="362">
<operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="write_model" expanded="true" height="60" name="Write Model" width="90" x="45" y="165">
<parameter key="model_file" value="L:\Mike\Strategy\Dayton\RM Dayton Prospect Model\Sample\Sample Prospect Model.mod"/>
</operator>
<operator activated="true" class="performance_classification" expanded="true" height="76" name="Performance" width="90" x="179" y="30">
<parameter key="main_criterion" value="accuracy"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<list key="class_weights"/>
</operator>
<operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="179" y="165"/>
<operator activated="true" class="write_performance" expanded="true" height="60" name="Write Performance" width="90" x="179" y="300">
<parameter key="performance_file" value="L:\Mike\Strategy\Dayton\RM Dayton Prospect Model\Sample\Sample Prospect Performance.per"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Apply Model" from_port="model" to_op="Write Model" to_port="input"/>
<connect from_op="Performance" from_port="performance" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="averagable 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="Write Performance" to_port="input"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Weights" from_port="output" to_op="Select by Weights" to_port="weights"/>
<connect from_op="Generate Direct Mailing Data" from_port="output" to_op="Select by Weights" to_port="example set input"/>
<connect from_op="Select by Weights" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
3) ClassificationPerformance
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="362" width="1557">
<operator activated="true" class="read_weights" expanded="true" height="60" name="Read Weights" width="90" x="45" y="120">
<parameter key="attribute_weights_file" value="L:\Mike\Strategy\Dayton\RM Dayton Prospect Model\Sample\Sample Prospect Weights.wgt"/>
</operator>
<operator activated="true" class="generate_direct_mailing_data" expanded="true" height="60" name="Generate Direct Mailing Data" width="90" x="45" y="30">
<parameter key="number_examples" value="10000"/>
</operator>
<operator activated="true" class="select_by_weights" expanded="true" height="94" name="Select by Weights" width="90" x="179" y="30">
<parameter key="weight_relation" value="equals"/>
</operator>
<operator activated="true" class="nominal_to_numerical" expanded="true" height="94" name="Nominal to Numerical" width="90" x="313" y="30">
<parameter key="attributes" value="taker|sampling|region|lifestage|dwellingtype|dontsms|dontmail|dontemail|dncfederal|dnccbw|dnccbt|directvv|category"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="447" y="30">
<list key="columns"/>
</operator>
<operator activated="true" class="normalize" expanded="true" height="94" name="Normalize" width="90" x="581" y="30">
<parameter key="method" value="range transformation"/>
</operator>
<operator activated="true" class="read_model" expanded="true" height="60" name="Read Model" width="90" x="45" y="210">
<parameter key="model_file" value="L:\Mike\Strategy\Dayton\RM Dayton Prospect Model\Sample\Sample Prospect Model.mod"/>
</operator>
<operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="715" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" expanded="true" height="76" name="Performance" width="90" x="849" y="30">
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read Weights" from_port="output" to_op="Select by Weights" to_port="weights"/>
<connect from_op="Generate Direct Mailing Data" from_port="output" to_op="Select by Weights" to_port="example set input"/>
<connect from_op="Select by Weights" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Read Model" from_port="output" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0 -
Hi,
sorry, but which process does make the trouble? I'm a bit confused now...
If it works with the data generators, I cannot reproduce the error and can not tell you, why it occurs, because it does not occur. Sounds logical, doesn't it?
By the way: I would appreciate it a lot, if you could post the processes next time inside a code environment, makes the thread much more readable. It is createable by pressing the # button above.
Greetings,
Sebastian0