Whats wrong with this model application

michaelhecht · June 2009

Hello,

I just wanted to find the optimum parameter set on labeled data and afterwards to apply it on new, unlabeled data.
The data has two columns x and y (called OCM here). Reading only one column for the application of model
failed, i.e. RM told me that two columns are needed (Im sure this is a beginners error

). Therefore I applied
a two column file where I set all y-values to zero. As a result I got no prediction on the x-values but all values zero.
Hmmm ... I don't really understand how RM "thinks", so what's wrong?

Here is the code:


<operator name="Root" class="Process" expanded="yes">
    <operator name="MemoryCleanUp" class="MemoryCleanUp">
    </operator>
    <operator name="SimpleExampleSource" class="SimpleExampleSource">
        <parameter key="filename"	value="X:\HE\ModelleUntersuchungen\DataMining\PolyNomApproximation\ozm_svm.txt"/>
        <parameter key="read_attribute_names"	value="true"/>
        <parameter key="label_name"	value="OCM"/>
        <parameter key="label_column"	value="2"/>
    </operator>
    <operator name="OperatorChain" class="OperatorChain" expanded="yes">
        <operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
            <list key="parameters">
              <parameter key="Learner.N"	value="true,false"/>
              <parameter key="Learner.U"	value="true,false"/>
              <parameter key="Learner.R"	value="true,false"/>
              <parameter key="Learner.M"	value="[4.0;8.0;4;linear]"/>
              <parameter key="Learner.L"	value="true,false"/>
            </list>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="keep_example_set"	value="true"/>
                <operator name="Learner" class="W-M5P">
                    <parameter key="keep_example_set"	value="true"/>
                    <parameter key="M"	value="8.0"/>
                </operator>
                <operator name="OperatorChain (3)" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="Performance" class="Performance">
                    </operator>
                </operator>
            </operator>
        </operator>
    </operator>
    <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
        <operator name="SimpleExampleSource (2)" class="SimpleExampleSource">
            <parameter key="filename"	value="X:\HE\ModelleUntersuchungen\DataMining\PolyNomApproximation\ozm_svmTest.txt"/>
            <parameter key="read_attribute_names"	value="true"/>
            <parameter key="label_name"	value="OCM"/>
            <parameter key="label_column"	value="2"/>
        </operator>
        <operator name="ParameterSetter" class="ParameterSetter">
            <list key="name_map">
              <parameter key="Learner"	value="Applier"/>
            </list>
        </operator>
        <operator name="Applier" class="W-M5P">
            <parameter key="keep_example_set"	value="true"/>
            <parameter key="N"	value="true"/>
            <parameter key="U"	value="true"/>
            <parameter key="R"	value="true"/>
            <parameter key="L"	value="true"/>
        </operator>
        <operator name="ModelApplier (2)" class="ModelApplier">
            <list key="application_parameters">
            </list>
            <parameter key="create_view"	value="true"/>
        </operator>
    </operator>
</operator>

And here is the really simple data:

keith · June 2009

You need to think about your process as having three steps:

1) Find the parameters
2) Build the model with the optimal parameters
3) Apply the model to new data

Right now you have the optimal parameters from step 1). But you don't yet have the model in step 2) that enables you to generate predictions in step 3).

Something like this might be more what you're looking for. (Your original dataset was put into file RM_test_data.txt. Your new data for prediction (without OCM) was created as RM_test_data2.txt.)


<operator name="Root" class="Process" expanded="yes">
    <operator name="MemoryCleanUp" class="MemoryCleanUp">
    </operator>
    <operator name="Read in training data set" class="SimpleExampleSource">
        <parameter key="filename"	value="c:\temp\RM_test_data.txt"/>
        <parameter key="read_attribute_names"	value="true"/>
        <parameter key="label_name"	value="OCM"/>
        <parameter key="label_column"	value="2"/>
    </operator>
    <operator name="Determine optimal parameters" class="OperatorChain" expanded="no">
        <operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
            <list key="parameters">
              <parameter key="Learner.N"	value="true,false"/>
              <parameter key="Learner.U"	value="true,false"/>
              <parameter key="Learner.R"	value="true,false"/>
              <parameter key="Learner.M"	value="[4.0;8.0;4;linear]"/>
              <parameter key="Learner.L"	value="true,false"/>
            </list>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="keep_example_set"	value="true"/>
                <operator name="Learner" class="W-M5P">
                    <parameter key="keep_example_set"	value="true"/>
                    <parameter key="R"	value="true"/>
                </operator>
                <operator name="OperatorChain (3)" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="Performance" class="Performance">
                    </operator>
                </operator>
            </operator>
        </operator>
    </operator>
    <operator name="Build model with optimal parameters" class="OperatorChain" expanded="yes">
        <operator name="ParameterSetter" class="ParameterSetter">
            <list key="name_map">
              <parameter key="Learner"	value="Applier"/>
            </list>
        </operator>
        <operator name="Applier" class="W-M5P">
            <parameter key="N"	value="true"/>
            <parameter key="U"	value="true"/>
            <parameter key="R"	value="true"/>
            <parameter key="L"	value="true"/>
        </operator>
    </operator>
    <operator name="Apply model to new data" class="OperatorChain" expanded="yes">
        <operator name="Read in new data for prediction" class="SimpleExampleSource">
            <parameter key="filename"	value="c:\temp\RM_test_data2.txt"/>
            <parameter key="read_attribute_names"	value="true"/>
        </operator>
        <operator name="Generate predictions" class="ModelApplier">
            <list key="application_parameters">
            </list>
            <parameter key="create_view"	value="true"/>
        </operator>
    </operator>
</operator>

michaelhecht · June 2009

Thank You keith,

so my fault was, that I applied the right operators in the wrong order, did I?

But what doesn't work is, to apply the operator to a file with only one column, i.e.
without a y-column. Is there a hint or does I always have to provide a dummy y-column?

michaelhecht · June 2009

Hmmm,

I was happy too early. I forgot to change the original file which I modified to make my
workflow working.

After I removed the OCM column, I get again the error:
Could not read file 'c:\temp\RM_test_data2.txt': Number of columns in line 1 was unexpected, was: 1, expected: 2

So nothing changed?!

If I apply the workflow with the right number of columns I get the error:
Applier: Missing input: ExampleSet
in the Applier.

I use RM 4.4. So where is my problem?

keith · June 2009

No, I think the problem with your original is that you tried to go from step 1 to step 3 without having done step 2. :-)

Your first Operator Chain returns the optimal parameters, for your model, but not the model itself. Even though you have a W-M5P learner buried in the XValidation inside the GridParameterOptimization, the model doesn't get passed back out of the XVal node. That learner is "used up" just coming up with the parameters that you want to use with your eventual model. This is step 1.

Once you have the parameters, you need to have another W-M5P learner node downstream in the process for the ParameterSetter to work on. That's where the "real" model object gets created. You need the full dataset, including the label to train this model. This is step 2.

Once you have the model (with the optimized parameters) trained (with the training data), you're ready to predict new values. Unlike the example set used for parameter optimization or model training, the example set of new data you want to generate predictions on doesn't need a column for the label. It just needs the columns that are the inputs to the model. The act of applying the model to the new data (ModelApplier) will generate the prediction(label) column. This is step 3.

In your original description, it seemed like you thought that step 1 (parameter optimization) also generated a model that could be used for prediction, which it doesn't. You still needed to train the model using the original data set, and setting the parameters that you found in step 1.

Does that help clear things up?

Keith

keith · June 2009

michaelhecht wrote:

Hmmm,
After I removed the OCM column, I get again the error:
Could not read file 'c:\temp\RM_test_data2.txt': Number of columns in line 1 was unexpected, was: 1, expected: 2

So nothing changed?!

If I apply the workflow with the right number of columns I get the error:
Applier: Missing input: ExampleSet
in the Applier.

I use RM 4.4. So where is my problem?

Just a guess -- Did you clear the label_name and label_column parameters in the SimpleExampleSource node that reads in RM_test_data2.txt ? If you created the node by copying-pasting the node earlier in the process, you might have those values carried over, which aren't applicable since the new example set doesn't have a label.

Keith

michaelhecht · June 2009

Ok, now I'm a step further. I understood what was missing in my learning
and application chain. Well, now it's obvious.

What was missing in your example (in my opinion) was the example set in

"Build model with optimal parameters"

After implementing this, it works with the dummy column (see code below).
What you can also see in the code is, that I didn't miss to remove the label column
in the SimpleExampleSource

(see again code below)

Nevertheless I get:
[tt]
Error in: Read in new data for prediction (SimpleExampleSource)
Could not read file 'c:\temp\RM_test_data2.txt': Number of columns in line 1 was unexpected, was: 1, expected: 2.
The given file could not be read. Please make sure that the file exists and that the RapidMiner process has sufficient privileges.
[/tt]

The data file I attached again below.


<operator name="Root" class="Process" expanded="yes">
    <operator name="MemoryCleanUp" class="MemoryCleanUp">
    </operator>
    <operator name="Read in training data set" class="SimpleExampleSource">
        <parameter key="filename"	value="c:\temp\RM_test_data.txt"/>
        <parameter key="read_attribute_names"	value="true"/>
        <parameter key="label_name"	value="OCM"/>
        <parameter key="label_column"	value="2"/>
    </operator>
    <operator name="Determine optimal parameters" class="OperatorChain" expanded="yes">
        <operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
            <list key="parameters">
              <parameter key="Learner.N"	value="true,false"/>
              <parameter key="Learner.U"	value="true,false"/>
              <parameter key="Learner.R"	value="true,false"/>
              <parameter key="Learner.M"	value="[4.0;8.0;4;linear]"/>
              <parameter key="Learner.L"	value="true,false"/>
            </list>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="keep_example_set"	value="true"/>
                <operator name="Learner" class="W-M5P">
                    <parameter key="keep_example_set"	value="true"/>
                    <parameter key="M"	value="8.0"/>
                </operator>
                <operator name="OperatorChain (3)" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="Performance" class="Performance">
                    </operator>
                </operator>
            </operator>
        </operator>
    </operator>
    <operator name="Build model with optimal parameters" class="OperatorChain" expanded="yes">
        <operator name="Read in new data for prediction (2)" class="SimpleExampleSource">
            <parameter key="filename"	value="c:\temp\RM_test_data.txt"/>
            <parameter key="read_attribute_names"	value="true"/>
            <parameter key="label_name"	value="OCM"/>
            <parameter key="label_column"	value="2"/>
        </operator>
        <operator name="ParameterSetter" class="ParameterSetter">
            <list key="name_map">
              <parameter key="Learner"	value="Applier"/>
            </list>
        </operator>
        <operator name="Applier" class="W-M5P">
            <parameter key="N"	value="true"/>
            <parameter key="U"	value="true"/>
            <parameter key="R"	value="true"/>
            <parameter key="L"	value="true"/>
        </operator>
    </operator>
    <operator name="Apply model to new data" class="OperatorChain" expanded="yes">
        <operator name="Read in new data for prediction" class="SimpleExampleSource">
            <parameter key="filename"	value="c:\temp\RM_test_data2.txt"/>
            <parameter key="read_attribute_names"	value="true"/>
        </operator>
        <operator name="Generate predictions" class="ModelApplier">
            <list key="application_parameters">
            </list>
            <parameter key="create_view"	value="true"/>
        </operator>
    </operator>
</operator>

Data without y column.


CAE	
0.00
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
0.23
0.25
0.26
0.27
0.46
0.47
0.48
0.49

michaelhecht · June 2009

So this is the lat post from me (I hope).

I found now the error which is more related to a strange behaviour of RM!!

I had a blank after the column name of the x-column. This forced the
SimpleExampleSource to "think" that there is more than one column!

After removing all trailing blanks it works.

So thank you again for the training on RM.

keith · June 2009

Glad you got it working...

michaelhecht wrote:

Ok, now I'm a step further. I understood what was missing in my learning
and application chain. Well, now it's obvious.
What was missing in your example (in my opinion) was the example set in

"Build model with optimal parameters"

After implementing this, it works with the dummy column (see code below).

Interesting. I re-ran the code as I originally posted it (without a SimpleExampleSource node inside the "Build model with optimal parameters" chain) and it worked fine as is. Because "keep_example_set" is set on the XValidation node from the previous step, you shouldn't need to load it again.

It's true that the RM's "Validate" shows it as missing an example set, but it's been known to be wrong, as it was in this case. "Validate" is helpful, but it also gets confused easily. It should be taken as a suggestion, not as the absolute truth.

You should be able to run the process I posted, even with that apparent error message.

Keith

Whats wrong with this model application

Answers

Categories