[SOLVED] error imputing missing values using linear regression

cgkolar
cgkolar New Altair Community Member
edited November 5 in Community Q&A
Hi.  I was assuming that this would be straightforward thing to do. I have a dataset with surprisingly few missing values in just a few of the cases, I want to compute the missing values.    There is an ID field in the data but no label.  I set up the following process.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.014">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
   <process expanded="true" height="550" width="748">
     <operator activated="true" breakpoints="after" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve" width="90" x="76" y="158">
       <parameter key="repository_entry" value="c14 lcq for imputation short b"/>
     </operator>
     <operator activated="true" breakpoints="after" class="impute_missing_values" compatibility="5.1.014" expanded="true" height="60" name="Impute Missing Values" width="90" x="313" y="255">
       <parameter key="value_type" value="numeric"/>
       <process expanded="true" height="617" width="950">
         <operator activated="true" breakpoints="after" class="linear_regression" compatibility="5.1.014" expanded="true" height="94" name="Linear Regression" width="90" x="444" y="270">
           <parameter key="feature_selection" value="none"/>
         </operator>
         <connect from_port="example set source" to_op="Linear Regression" to_port="training set"/>
         <connect from_op="Linear Regression" from_port="model" to_port="model sink"/>
         <portSpacing port="source_example set source" spacing="0"/>
         <portSpacing port="sink_model sink" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" breakpoints="after" class="write_excel" compatibility="5.1.014" expanded="true" height="60" name="Write Excel" width="90" x="514" y="255">
       <parameter key="excel_file" value="C:\Documents and Settings\ckolar\My Documents\data model\lcq\c14 missing values mputed.xls"/>
     </operator>
     <connect from_op="Retrieve" from_port="output" to_op="Impute Missing Values" to_port="example set in"/>
     <connect from_op="Impute Missing Values" from_port="example set out" to_op="Write Excel" to_port="input"/>
     <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
It appears to run, and when I run in debug mode it shows me the regression results for each of the 26 variables, but it appears to get to the end and throws me this error:
Dec 6, 2011 6:05:32 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Dec 6, 2011 6:05:32 PM SEVERE: Here:           Process[1] (Process)
          subprocess 'Main Process'
            +- Retrieve[1] (Retrieve)
            +- Impute Missing Values[1] (Impute Missing Values)
          subprocess 'Replacement Learning'
      ==>   |     +- Linear Regression[26] (Linear Regression)
            +- Write Excel[0] (Write Excel)
Dec 6, 2011 6:05:32 PM FINER: Parameter 'send_mail' is not set. Using default ('never').
Dec 6, 2011 6:05:32 PM SEVERE: java.lang.NullPointerException
That's all I get in verbose mode.  Any suggestions would be appreciated, this is my first time trying to impute missing values so much of this is a learning exercise for me.  Thanks, CK

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    in your posted XML code the last lines are missing. Can you please post you complete process setup?

    Kind regards,
    Marius
  • MariusHelf
    MariusHelf New Altair Community Member
    I replaced k-NN by a Linear Regression and I can't reproduce your errors. Since the Linear Regression can only handle real valued or binominal labels, in the Labor example it only replaces the real valued attributes.

    Are you using a current version of RapidMiner? If yes, the problem probably only occurs with your data, and a minimum set of data with which the error occurs would be helpful. Another helpful thing is the "Show Details" button in the error dialog you should get in debug mode. Please hit it and paste the stacktrace here.

    Cheers, Marius
  • cgkolar
    cgkolar New Altair Community Member
    I have a small dataset, all real, what is strange is that when I look at the imputation operation and hover over the example set output of the linear regression operator it shows no more missing values (they do show up on the example set input).  It seems like it is somehow getting hung up trying to get out of the impute missing values operator.

    Still not seeing an obvious mistake. Here is one moment of brokenness from the log window:
    Dec 7, 2011 11:35:36 AM FINE: Executing subprocess Impute Missing Values.Replacement Learning. Execution order is: [Linear Regression (Linear Regression)]
    Dec 7, 2011 11:35:36 AM FINE: Starting application 18 of operator Linear Regression
    Dec 7, 2011 11:35:36 AM FINER: Linear Regression called 18th time with input:
     training setConditionedExampleSet:
    217 examples,
    25 regular attributes,
    special attributes = {
       label = #17: i31 (real/single_value)
    }
    Dec 7, 2011 11:35:36 AM FINER: Parameter 'use_bias' is not set. Using default ('true').
    Dec 7, 2011 11:35:36 AM FINER: Parameter 'eliminate_colinear_features' is not set. Using default ('true').
    Dec 7, 2011 11:35:36 AM FINER: Parameter 'ridge' is not set. Using default ('1.0E-8').
    Dec 7, 2011 11:35:36 AM FINER: Parameter 'min_tolerance' is not set. Using default ('0.05').
    Dec 7, 2011 11:35:36 AM FINER: Parameter 'feature_selection' is not set. Using default ('M5 prime').
    Dec 7, 2011 11:35:37 AM FINE: Completed application 18 of operator Linear Regression
    Dec 7, 2011 11:35:37 AM FINER: Linear Regression returned with output:
     model  0.167 * i2
    - 0.064 * i4
    + 0.063 * i5
    + 0.120 * i14
    + 0.054 * i15
    + 0.040 * i16
    + 0.178 * i17
    - 0.040 * i20
    - 0.159 * i23
    - 0.129 * i24
    + 0.082 * i25
    - 0.037 * i28
    - 0.085 * i29
    - 0.107 * i33
    - 0.086 * i34
    - 0.031 * i37
    + 0.261 * i38
    - 0.129 * i39
    + 0.130 * i40
    + 0.078 * i41
    - 0.170 * i42
    + 2.964
     exampleSetConditionedExampleSet:
    217 examples,
    25 regular attributes,
    special attributes = {
       label = #17: i31 (real/single_value)
    }
     weights-/-
    Dec 7, 2011 11:35:37 AM FINEST: Linear Regression: execution time was 78 ms
    Dec 7, 2011 11:35:37 AM FINE: Impute Missing Values: Imputating missing values in attribute i31.
    Dec 7, 2011 11:35:37 AM WARNING: Impute Missing Values: Unable to impute 1 missing values in attribute i31.
    The missingXML ending is:

       </process>
     </operator>
    </process>

    C

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi, I still can't reproduce the NullPointerException on my data. Do you get a message box which states that something went wrong? If so, there should be a button to submit a bug. Then please use the bug report wizard to report the bug to our bugtracker. There will be some information included automatically which will help us to track down the bug.

    Best regards,
    Marius

    EDIT: just saw your PN, trying with your data right now.
  • haddock
    haddock New Altair Community Member
    Hi Folks,
    The operator MissingValueImpution imputes missing values by learning models for each attribute (except the label) and applying those models to the data set.
    Models built by regression also need labels in the data they model, but....
    There is an ID field in the data but no label.
    Just a thought

    PS  MissingValueImpution should read MissingValueImputation
  • MariusHelf
    MariusHelf New Altair Community Member
    Hi haddock,

    generally you are right, of course a regression needs a label. The Impute Missing Values operator however iterates attributes with missing values. It temporarily defines the current attribute as label, splits the dataset in examples with and without missing values, learns a model on the complete examples and applies it on the examples with missing values.
    When all attributes with missing values have been treated, the original label (if present) is restored.

    Now the problem was indeed that the cgkolar's dataset did not contain a label, because there was a bug in Impute Missing Values. I just fixed that bug, the fix will be included in the next release. Until then, the process below can be used as a  workaround.

    Cheers,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.014">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
        <process expanded="true" height="459" width="681">
          <operator activated="false" breakpoints="after" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve" width="90" x="45" y="165">
            <parameter key="repository_entry" value="c14 lcq for imputation short b"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="30">
            <parameter key="repository_entry" value="c14 lcq for imputation short c real"/>
          </operator>
          <operator activated="true" class="generate_empty_attribute" compatibility="5.1.014" expanded="true" height="76" name="Generate Empty Attribute" width="90" x="179" y="30">
            <parameter key="name" value="fake_label"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.1.014" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
            <parameter key="name" value="fake_label"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="impute_missing_values" compatibility="5.1.014" expanded="true" height="60" name="Impute Missing Values" width="90" x="447" y="30">
            <parameter key="value_type" value="numeric"/>
            <process expanded="true" height="459" width="564">
              <operator activated="false" class="linear_regression" compatibility="5.1.014" expanded="true" height="94" name="Linear Regression" width="90" x="179" y="165">
                <parameter key="feature_selection" value="none"/>
              </operator>
              <operator activated="true" class="k_nn" compatibility="5.1.014" expanded="true" height="76" name="k-NN" width="90" x="313" y="30"/>
              <connect from_port="example set source" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model sink"/>
              <portSpacing port="source_example set source" spacing="0"/>
              <portSpacing port="sink_model sink" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.1.014" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="fake_label"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <connect from_op="Retrieve (2)" from_port="output" to_op="Generate Empty Attribute" to_port="example set input"/>
          <connect from_op="Generate Empty Attribute" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Impute Missing Values" to_port="example set in"/>
          <connect from_op="Impute Missing Values" from_port="example set out" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • cgkolar
    cgkolar New Altair Community Member
    Thank you Marius and haddock.  To be honest, I am glad that it was a bug and not something going wrong in my head.  I appreciate all of the attention.  Problem solved.  CK