"Remove Samples With Many Missing Values"

dragoljub
dragoljub New Altair Community Member
edited November 5 in Community Q&A
I would like to filter examples that have more than a certain number of missing values, so I can then apply attribute filtering and not loose my entire dataset.

Any way to do this, the filter operator currently filters samples with any missing values.

Thanks,
-Gagi

Answers

  • land
    land New Altair Community Member
    Hi,
    you have already posted in another thread that was presenting the solution for this problem. Why did you open just another one?

    Greetings,
      Sebastian
  • dragoljub
    dragoljub New Altair Community Member
    Actually the other thread was about removing 'attributes'. I want to remove 'examples'.

    For example: Filter all examples that have more than 10 missing values for the current set of attributes.

    This way we can do data shrinking from both dimensions. I have a lot of samples so this would be useful.  ;D

    Thanks,
    -Gagi
  • land
    land New Altair Community Member
    Hi,
    excuse me, I should have read more carefully. Especially after I was surprised, that suddenly you are  starting to spam the forum. Well, back to your original questions:
    The computer scientist in me wants to answer: Transpose the ExampleSet and solve the old attribute problem, transpose it again and the things are cleared. But it doesn't seem to be appropriate :) So let's see...

    After I tried quite some time, I didn't come up with a really satisfying solution. Here's the ground idea: Count each missing value into a new attribute and then filter the examples accordingly. Here's the process for counting the missings:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="251" width="748">
          <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
          </operator>
          <operator activated="true" class="rename_by_replacing" expanded="true" height="76" name="Rename by Replacing" width="90" x="179" y="30">
            <parameter key="replace_what" value="-"/>
            <parameter key="replace_by" value="_"/>
          </operator>
          <operator activated="true" class="generate_empty_attribute" expanded="true" height="76" name="Generate Empty Attribute" width="90" x="313" y="30">
            <parameter key="name" value="numberOfMissings"/>
          </operator>
          <operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="447" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="numberOfMissings"/>
            <parameter key="default" value="zero"/>
            <list key="columns"/>
          </operator>
          <operator activated="true" class="loop_attributes" expanded="true" height="60" name="Loop Attributes" width="90" x="581" y="30">
            <parameter key="iteration_macro" value="currentAttribute"/>
            <process expanded="true" height="581" width="764">
              <operator activated="true" class="generate_attributes" expanded="true" height="76" name="Generate Attributes" width="90" x="45" y="30">
                <list key="function_descriptions">
                  <parameter key="newNumberOfMissings" value="numberOfMissings + if(%{currentAttribute}!=%{currentAttribute},1,0)"/>
                </list>
              </operator>
              <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="numberOfMissings"/>
                <parameter key="invert_selection" value="true"/>
              </operator>
              <operator activated="true" class="rename" expanded="true" height="76" name="Rename" width="90" x="313" y="30">
                <parameter key="old_name" value="newNumberOfMissings"/>
                <parameter key="new_name" value="numberOfMissings"/>
              </operator>
              <connect from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
              <connect from_op="Rename" from_port="example set output" to_port="example set"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Rename by Replacing" to_port="example set input"/>
          <connect from_op="Rename by Replacing" from_port="example set output" to_op="Generate Empty Attribute" to_port="example set input"/>
          <connect from_op="Generate Empty Attribute" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
          <connect from_op="Replace Missing Values" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
          <connect from_op="Loop Attributes" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Why this isn't satisfying? Because it only works on real valued attributes. So if this is a solution depends on what you need...
    This again is a valuable feature, we should include in future versions. Please add it to the bug tracker as a feature request.

    Greetings,
      Sebastian
  • ui3o
    ui3o New Altair Community Member
    Hi Sebastian,

    can you share a word on the if(%{currentAttribute}!=%{currentAttribute}...) statement. Seems to be very helpful. What's behind it?
    Thx


    ui3o
  • dragoljub
    dragoljub New Altair Community Member
    Thanks Sebastian,

    I was able to remove attributes and then filter examples for a similar effect.

    The if statement looks like it follows this syntax: if(statement, true_action, false_action), correct?

    Is there a list of supported functions in RM that we can use in expressions?

    Thanks Again,
    -Gagi
  • haddock
    haddock New Altair Community Member
    Greets Seb,

    Late night waiting for US calls so I thought I'd take a look at making a missing value count column, and then bolting that on to the original data, like this..
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros>
          <macro>
            <key/>
            <value/>
          </macro>
        </macros>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="353" width="808">
          <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
          </operator>
          <operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
          <operator activated="true" class="generate_id" expanded="true" height="76" name="Original Data" width="90" x="380" y="120"/>
          <operator activated="true" class="subprocess" expanded="true" height="76" name="Missing Count Column" width="90" x="380" y="30">
            <process expanded="true" height="353" width="748">
              <operator activated="true" class="transpose" expanded="true" height="76" name="Transpose" width="90" x="180" y="30"/>
              <operator activated="true" class="loop_attributes" expanded="true" height="60" name="Loop Attributes" width="90" x="315" y="30">
                <process expanded="true" height="353" width="808">
                  <operator activated="true" class="extract_macro" expanded="true" height="60" name="Count unknowns" width="90" x="45" y="30">
                    <parameter key="macro" value="Count"/>
                    <parameter key="macro_type" value="statistics"/>
                    <parameter key="statistics" value="unknown"/>
                    <parameter key="attribute_name" value="%{loop_attribute}"/>
                    <parameter key="example_index" value="%{b}"/>
                  </operator>
                  <operator activated="true" class="provide_macro_as_log_value" expanded="true" height="76" name="Provide Count as Log Value" width="90" x="180" y="30">
                    <parameter key="macro_name" value="Total"/>
                  </operator>
                  <operator activated="true" class="log" expanded="true" height="76" name="Log" width="90" x="494" y="30">
                    <list key="log">
                      <parameter key="Missings" value="operator.Count unknowns.value.macro_value"/>
                    </list>
                  </operator>
                  <connect from_port="example set" to_op="Count unknowns" to_port="example set"/>
                  <connect from_op="Count unknowns" from_port="example set" to_op="Provide Count as Log Value" to_port="through 1"/>
                  <connect from_op="Provide Count as Log Value" from_port="through 1" to_op="Log" to_port="through 1"/>
                  <connect from_op="Log" from_port="through 1" to_port="example set"/>
                  <portSpacing port="source_example set" spacing="0"/>
                  <portSpacing port="sink_example set" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="log_to_data" expanded="true" height="94" name="Log to Data" width="90" x="179" y="165"/>
              <operator activated="true" class="parse_numbers" expanded="true" height="76" name="Parse Numbers" width="90" x="313" y="165"/>
              <operator activated="true" class="generate_id" expanded="true" height="76" name="Generate ID" width="90" x="447" y="165"/>
              <connect from_port="in 1" to_op="Transpose" to_port="example set input"/>
              <connect from_op="Transpose" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
              <connect from_op="Loop Attributes" from_port="example set" to_op="Log to Data" to_port="through 1"/>
              <connect from_op="Log to Data" from_port="exampleSet" to_op="Parse Numbers" to_port="example set input"/>
              <connect from_op="Parse Numbers" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
              <connect from_op="Generate ID" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="join" expanded="true" height="76" name="Data plus Missing Count" width="90" x="565" y="31"/>
          <connect from_op="Retrieve" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Missing Count Column" to_port="in 1"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Original Data" to_port="example set input"/>
          <connect from_op="Original Data" from_port="example set output" to_op="Data plus Missing Count" to_port="right"/>
          <connect from_op="Missing Count Column" from_port="out 1" to_op="Data plus Missing Count" to_port="left"/>
          <connect from_op="Data plus Missing Count" from_port="join" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    The computer scientist in me wants to answer: Transpose the ExampleSet and solve the old attribute problem, transpose it again and the things are cleared. But it doesn't seem to be appropriate
    Be wary of the part of you which doubted the appropriateness of transposition  :D

  • dragoljub
    dragoljub New Altair Community Member
    Thanks Haddock,

    This will be very useful.

    The transpose is quite smart and lets you play within the confines of the RM space.  ;)

    -Gagi