"Impute missing values using a saved model"

jmrichardson
jmrichardson New Altair Community Member
edited November 5 in Community Q&A
Hello,

I am trying to impute missing values using knn learner.  I am working with a large dataset and saved the model.  Now, I want to use the saved model for new (unseen) data in the impute operator.  This is because the new data is a much smaller sample size.  Unfortunately, I cannot get the saved model to impute the dataset.  Can someone please help me.  Here is what I am trying to do but does not work:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" breakpoints="after" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Labor-Negotiations" width="90" x="313" y="30">
        <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
      </operator>
      <operator activated="true" breakpoints="after" class="impute_missing_values" compatibility="5.3.008" expanded="true" height="60" name="Impute Missing Values" width="90" x="514" y="30">
        <process expanded="true">
          <operator activated="true" class="read_model" compatibility="5.3.008" expanded="true" height="60" name="Read Model" width="90" x="246" y="30">
            <parameter key="model_file" value="C:\Users\John Richardson\Desktop\test"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model" width="90" x="380" y="165">
            <list key="application_parameters"/>
          </operator>
          <connect from_port="example set source" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Read Model" from_port="output" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="model" to_port="model sink"/>
          <portSpacing port="source_example set source" spacing="0"/>
          <portSpacing port="sink_model sink" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Labor-Negotiations" from_port="output" to_op="Impute Missing Values" to_port="example set in"/>
      <connect from_op="Impute Missing Values" from_port="example set out" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Thanks in advance!
John

Answers

  • Hello John

    I made the attached work - not sure what your error was
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.013" expanded="true" height="60" name="Labor-Negotiations" width="90" x="179" y="30">
            <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
          </operator>
          <operator activated="true" class="k_nn" compatibility="5.3.013" expanded="true" height="76" name="k-NN" width="90" x="179" y="120"/>
          <operator activated="true" class="remember" compatibility="5.3.013" expanded="true" height="60" name="Remember" width="90" x="380" y="30">
            <parameter key="name" value="m"/>
            <parameter key="io_object" value="Model"/>
          </operator>
          <operator activated="true" class="impute_missing_values" compatibility="5.3.013" expanded="true" height="60" name="Impute Missing Values" width="90" x="380" y="120">
            <process expanded="true">
              <operator activated="true" class="recall" compatibility="5.3.013" expanded="true" height="60" name="Recall" width="90" x="313" y="30">
                <parameter key="name" value="m"/>
                <parameter key="io_object" value="Model"/>
                <parameter key="remove_from_store" value="false"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="5.3.013" expanded="true" height="76" name="Apply Model" width="90" x="447" y="165">
                <list key="application_parameters"/>
              </operator>
              <connect from_port="example set source" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Recall" from_port="result" to_op="Apply Model" to_port="model"/>
              <connect from_op="Apply Model" from_port="model" to_port="model sink"/>
              <portSpacing port="source_example set source" spacing="0"/>
              <portSpacing port="sink_model sink" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Labor-Negotiations" from_port="output" to_op="k-NN" to_port="training set"/>
          <connect from_op="k-NN" from_port="model" to_op="Remember" to_port="store"/>
          <connect from_op="k-NN" from_port="exampleSet" to_op="Impute Missing Values" to_port="example set in"/>
          <connect from_op="Impute Missing Values" from_port="example set out" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • jmrichardson
    jmrichardson New Altair Community Member
    Hi Andrew,

    Thank you for your quick reply.  I checked your code and it does work.  However, I am not sure if it is what I am trying to accomplish.  I would like to have the saved model from within the impute operator to be used later.  Here is the code (using the tutorial) which appears to be saving the model correctly.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Labor-Negotiations" width="90" x="112" y="30">
            <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
          </operator>
          <operator activated="true" class="impute_missing_values" compatibility="5.3.008" expanded="true" height="60" name="Impute Missing Values" width="90" x="313" y="30">
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="5.3.008" expanded="true" height="76" name="k-NN" width="90" x="246" y="30"/>
              <operator activated="true" class="write_model" compatibility="5.3.008" expanded="true" height="60" name="Write Model" width="90" x="447" y="30">
                <parameter key="model_file" value="C:\save.mod"/>
              </operator>
              <connect from_port="example set source" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_op="Write Model" to_port="input"/>
              <connect from_op="Write Model" from_port="through" to_port="model sink"/>
              <portSpacing port="source_example set source" spacing="0"/>
              <portSpacing port="sink_model sink" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Labor-Negotiations" from_port="output" to_op="Impute Missing Values" to_port="example set in"/>
          <connect from_op="Impute Missing Values" from_port="example set out" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    Now, here is the code that I am using which calls the saved model and tries to impute the same data set (using the saved model).  This appears to work on all fields except bi-nomial classes (education-allowance and longterm-disability-assistance).  All the other fields were imputed except for these 2.  It almost seems to skip over these for some reason?

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Labor-Negotiations" width="90" x="313" y="30">
            <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
          </operator>
          <operator activated="true" class="impute_missing_values" compatibility="5.3.008" expanded="true" height="60" name="Impute Missing Values" width="90" x="514" y="30">
            <process expanded="true">
              <operator activated="true" class="read_model" compatibility="5.3.008" expanded="true" height="60" name="Read Model" width="90" x="246" y="30">
                <parameter key="model_file" value="C:\save.mod"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model" width="90" x="380" y="165">
                <list key="application_parameters"/>
              </operator>
              <connect from_port="example set source" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Read Model" from_port="output" to_op="Apply Model" to_port="model"/>
              <connect from_op="Apply Model" from_port="model" to_port="model sink"/>
              <portSpacing port="source_example set source" spacing="0"/>
              <portSpacing port="sink_model sink" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Labor-Negotiations" from_port="output" to_op="Impute Missing Values" to_port="example set in"/>
          <connect from_op="Impute Missing Values" from_port="example set out" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Thanks again for your help,
    John
  • Hello John

    As it happens the Impute Missing Values operator is a complex beast and my example is not likely to be of much use.

    The operator iterates for all attributes which contain missing values and builds a prediction model using it as the label. In this case, it will iterate 16 times. One of the parameters is "learn on complete cases" which means that only data that has no missing values is used to train the model. For the data in this case there is only one example that meets this criterion.

    The net result is that each iteration will create a model based on one row of training data and the last will be stored in the repository. This means that when used multiple times in the later iteration, it will be difficult to predict how the model will behave given that the attributes used to build it will generally be different. I have a suspicion that binominal predictions will be difficult if only a single row of training data is used. The lack of training data will also cause an issue of overfitting but this is a different problem.

    There are two things to do. Firstly, create and increment a macro that keeps track of the attribute that is being used temporarily as a label and use that in the name of the model to be saved. Later in the second imputation loop, increment another macro to ensure the correct model is recalled.

    Secondly, uncheck the "learn on complete cases" parameter. This will drag in more training data but care is needed if the model is poor at handling missing values. I believe k-nn is neither particularly good nor particularly bad when handling missing values. As usual with data mining, it depends what you are trying to achieve when working out how to get the best from your data.

    Hope that helps...

    regards

    Andrew
  • jmrichardson
    jmrichardson New Altair Community Member
    Hi Andrew,

    Ok, I understand your solution.  However, I am having a problem generating the process in Rapidminer.  I have attached the "extract macro" to the base learner of the impute operator to try to extract the label in which to use in the write model operator (for each iteration).  However, I am not able to figure out how to extract the label in a macro for each iteration.  Can you please help me with this?  You have been so helpful so far, I am hoping you can get me past this last hurdle.

    Thanks again,
    John
  • Hello John

    OK - here's a simple example that uses macros to cause differently named models to be written to c:\temp for later reading.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.013" expanded="true" height="60" name="Retrieve Labor-Negotiations" width="90" x="112" y="75">
            <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="5.3.013" expanded="true" height="76" name="Generate Macro" width="90" x="112" y="165">
            <list key="function_descriptions">
              <parameter key="loop" value="1"/>
            </list>
          </operator>
          <operator activated="true" class="impute_missing_values" compatibility="5.3.013" expanded="true" height="60" name="Impute Missing Values" width="90" x="246" y="75">
            <parameter key="learn_on_complete_cases" value="false"/>
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="5.3.013" expanded="true" height="76" name="k-NN" width="90" x="179" y="75"/>
              <operator activated="true" class="write_model" compatibility="5.3.013" expanded="true" height="60" name="Write Model" width="90" x="313" y="75">
                <parameter key="model_file" value="c:\temp\model%{loop}"/>
              </operator>
              <operator activated="true" class="generate_macro" compatibility="5.3.013" expanded="true" height="76" name="Generate Macro (2)" width="90" x="447" y="75">
                <list key="function_descriptions">
                  <parameter key="loop" value="%{loop}+1"/>
                </list>
              </operator>
              <connect from_port="example set source" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_op="Write Model" to_port="input"/>
              <connect from_op="Write Model" from_port="through" to_op="Generate Macro (2)" to_port="through 1"/>
              <connect from_op="Generate Macro (2)" from_port="through 1" to_port="model sink"/>
              <portSpacing port="source_example set source" spacing="0"/>
              <portSpacing port="sink_model sink" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="retrieve" compatibility="5.3.013" expanded="true" height="60" name="Retrieve Labor-Negotiations (2)" width="90" x="112" y="300">
            <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="5.3.013" expanded="true" height="76" name="Generate Macro (4)" width="90" x="112" y="390">
            <list key="function_descriptions">
              <parameter key="loop" value="1"/>
            </list>
          </operator>
          <operator activated="true" class="impute_missing_values" compatibility="5.3.013" expanded="true" height="60" name="Impute Missing Values (2)" width="90" x="246" y="300">
            <parameter key="learn_on_complete_cases" value="false"/>
            <process expanded="true">
              <operator activated="true" class="read_model" compatibility="5.3.013" expanded="true" height="60" name="Read Model" width="90" x="179" y="75">
                <parameter key="model_file" value="c:\temp\model%{loop}"/>
              </operator>
              <operator activated="true" class="generate_macro" compatibility="5.3.013" expanded="true" height="76" name="Generate Macro (3)" width="90" x="380" y="75">
                <list key="function_descriptions">
                  <parameter key="loop" value="%{loop}+1"/>
                </list>
              </operator>
              <connect from_op="Read Model" from_port="output" to_op="Generate Macro (3)" to_port="through 1"/>
              <connect from_op="Generate Macro (3)" from_port="through 1" to_port="model sink"/>
              <portSpacing port="source_example set source" spacing="0"/>
              <portSpacing port="sink_model sink" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Labor-Negotiations" from_port="output" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Generate Macro" from_port="through 1" to_op="Impute Missing Values" to_port="example set in"/>
          <connect from_op="Impute Missing Values" from_port="example set out" to_port="result 1"/>
          <connect from_op="Retrieve Labor-Negotiations (2)" from_port="output" to_op="Generate Macro (4)" to_port="through 1"/>
          <connect from_op="Generate Macro (4)" from_port="through 1" to_op="Impute Missing Values (2)" to_port="example set in"/>
          <connect from_op="Impute Missing Values (2)" from_port="example set out" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • jmrichardson
    jmrichardson New Altair Community Member
    Hi Andrew,

    THANK YOU, THANK YOU!

    You are awesome! :)
    John