"Remove Samples With Many Missing Values"
dragoljub
New Altair Community Member
I would like to filter examples that have more than a certain number of missing values, so I can then apply attribute filtering and not loose my entire dataset.
Any way to do this, the filter operator currently filters samples with any missing values.
Thanks,
-Gagi
Any way to do this, the filter operator currently filters samples with any missing values.
Thanks,
-Gagi
Tagged:
0
Answers
-
Hi,
you have already posted in another thread that was presenting the solution for this problem. Why did you open just another one?
Greetings,
Sebastian0 -
Actually the other thread was about removing 'attributes'. I want to remove 'examples'.
For example: Filter all examples that have more than 10 missing values for the current set of attributes.
This way we can do data shrinking from both dimensions. I have a lot of samples so this would be useful. ;D
Thanks,
-Gagi
0 -
Hi,
excuse me, I should have read more carefully. Especially after I was surprised, that suddenly you are starting to spam the forum. Well, back to your original questions:
The computer scientist in me wants to answer: Transpose the ExampleSet and solve the old attribute problem, transpose it again and the things are cleared. But it doesn't seem to be appropriate So let's see...
After I tried quite some time, I didn't come up with a really satisfying solution. Here's the ground idea: Count each missing value into a new attribute and then filter the examples accordingly. Here's the process for counting the missings:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Why this isn't satisfying? Because it only works on real valued attributes. So if this is a solution depends on what you need...
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="251" width="748">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
</operator>
<operator activated="true" class="rename_by_replacing" expanded="true" height="76" name="Rename by Replacing" width="90" x="179" y="30">
<parameter key="replace_what" value="-"/>
<parameter key="replace_by" value="_"/>
</operator>
<operator activated="true" class="generate_empty_attribute" expanded="true" height="76" name="Generate Empty Attribute" width="90" x="313" y="30">
<parameter key="name" value="numberOfMissings"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="447" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="numberOfMissings"/>
<parameter key="default" value="zero"/>
<list key="columns"/>
</operator>
<operator activated="true" class="loop_attributes" expanded="true" height="60" name="Loop Attributes" width="90" x="581" y="30">
<parameter key="iteration_macro" value="currentAttribute"/>
<process expanded="true" height="581" width="764">
<operator activated="true" class="generate_attributes" expanded="true" height="76" name="Generate Attributes" width="90" x="45" y="30">
<list key="function_descriptions">
<parameter key="newNumberOfMissings" value="numberOfMissings + if(%{currentAttribute}!=%{currentAttribute},1,0)"/>
</list>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="numberOfMissings"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="rename" expanded="true" height="76" name="Rename" width="90" x="313" y="30">
<parameter key="old_name" value="newNumberOfMissings"/>
<parameter key="new_name" value="numberOfMissings"/>
</operator>
<connect from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_port="example set"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Rename by Replacing" to_port="example set input"/>
<connect from_op="Rename by Replacing" from_port="example set output" to_op="Generate Empty Attribute" to_port="example set input"/>
<connect from_op="Generate Empty Attribute" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
<connect from_op="Loop Attributes" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
This again is a valuable feature, we should include in future versions. Please add it to the bug tracker as a feature request.
Greetings,
Sebastian
0 -
Hi Sebastian,
can you share a word on the if(%{currentAttribute}!=%{currentAttribute}...) statement. Seems to be very helpful. What's behind it?
Thx
ui3o0 -
Thanks Sebastian,
I was able to remove attributes and then filter examples for a similar effect.
The if statement looks like it follows this syntax: if(statement, true_action, false_action), correct?
Is there a list of supported functions in RM that we can use in expressions?
Thanks Again,
-Gagi0 -
Greets Seb,
Late night waiting for US calls so I thought I'd take a look at making a missing value count column, and then bolting that on to the original data, like this..<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros>
<macro>
<key/>
<value/>
</macro>
</macros>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="353" width="808">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
</operator>
<operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
<operator activated="true" class="generate_id" expanded="true" height="76" name="Original Data" width="90" x="380" y="120"/>
<operator activated="true" class="subprocess" expanded="true" height="76" name="Missing Count Column" width="90" x="380" y="30">
<process expanded="true" height="353" width="748">
<operator activated="true" class="transpose" expanded="true" height="76" name="Transpose" width="90" x="180" y="30"/>
<operator activated="true" class="loop_attributes" expanded="true" height="60" name="Loop Attributes" width="90" x="315" y="30">
<process expanded="true" height="353" width="808">
<operator activated="true" class="extract_macro" expanded="true" height="60" name="Count unknowns" width="90" x="45" y="30">
<parameter key="macro" value="Count"/>
<parameter key="macro_type" value="statistics"/>
<parameter key="statistics" value="unknown"/>
<parameter key="attribute_name" value="%{loop_attribute}"/>
<parameter key="example_index" value="%{b}"/>
</operator>
<operator activated="true" class="provide_macro_as_log_value" expanded="true" height="76" name="Provide Count as Log Value" width="90" x="180" y="30">
<parameter key="macro_name" value="Total"/>
</operator>
<operator activated="true" class="log" expanded="true" height="76" name="Log" width="90" x="494" y="30">
<list key="log">
<parameter key="Missings" value="operator.Count unknowns.value.macro_value"/>
</list>
</operator>
<connect from_port="example set" to_op="Count unknowns" to_port="example set"/>
<connect from_op="Count unknowns" from_port="example set" to_op="Provide Count as Log Value" to_port="through 1"/>
<connect from_op="Provide Count as Log Value" from_port="through 1" to_op="Log" to_port="through 1"/>
<connect from_op="Log" from_port="through 1" to_port="example set"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log_to_data" expanded="true" height="94" name="Log to Data" width="90" x="179" y="165"/>
<operator activated="true" class="parse_numbers" expanded="true" height="76" name="Parse Numbers" width="90" x="313" y="165"/>
<operator activated="true" class="generate_id" expanded="true" height="76" name="Generate ID" width="90" x="447" y="165"/>
<connect from_port="in 1" to_op="Transpose" to_port="example set input"/>
<connect from_op="Transpose" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
<connect from_op="Loop Attributes" from_port="example set" to_op="Log to Data" to_port="through 1"/>
<connect from_op="Log to Data" from_port="exampleSet" to_op="Parse Numbers" to_port="example set input"/>
<connect from_op="Parse Numbers" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="join" expanded="true" height="76" name="Data plus Missing Count" width="90" x="565" y="31"/>
<connect from_op="Retrieve" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Missing Count Column" to_port="in 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="Original Data" to_port="example set input"/>
<connect from_op="Original Data" from_port="example set output" to_op="Data plus Missing Count" to_port="right"/>
<connect from_op="Missing Count Column" from_port="out 1" to_op="Data plus Missing Count" to_port="left"/>
<connect from_op="Data plus Missing Count" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Be wary of the part of you which doubted the appropriateness of transpositionThe computer scientist in me wants to answer: Transpose the ExampleSet and solve the old attribute problem, transpose it again and the things are cleared. But it doesn't seem to be appropriate
0 -
Thanks Haddock,
This will be very useful.
The transpose is quite smart and lets you play within the confines of the RM space.
-Gagi0