Imputing Nulls using Grouped by values

Hello Community,
Can we impute Null values with Grouped by some column of Non Null Values?
Eg : In titanic problem, I want to impute Age column with the mean age grouped by Sex. Now, this logic can be as complex as we want. Is there a functionality in RM to write our own rules for imputing Nulls?
P.S. I have seen the Imputing Missing Values operator. KNN may not work for all cases so we may need custom rules.
Thanks
Best Answer
-
Hi,
what about loop values, filter example, replace missings?
Edit: a quicker way uses Aggregate and join. An example is attached.
~Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.3.001" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="136">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="136"/>
<operator activated="true" class="aggregate" compatibility="7.3.001" expanded="true" height="82" name="Aggregate" width="90" x="380" y="34">
<list key="aggregation_attributes">
<parameter key="Age" value="average"/>
</list>
<parameter key="group_by_attributes" value="Sex"/>
</operator>
<operator activated="true" class="join" compatibility="7.3.001" expanded="true" height="82" name="Join" width="90" x="581" y="136">
<parameter key="join_type" value="right"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="Sex" value="Sex"/>
</list>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="136">
<list key="function_descriptions">
<parameter key="Age" value="if(missing(Age),[average(Age)],Age)"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="916" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="average(Age)"/>
<parameter key="invert_selection" value="true"/>
</operator>
<connect from_op="Retrieve Titanic" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>2
Answers
-
Hi,
what about loop values, filter example, replace missings?
Edit: a quicker way uses Aggregate and join. An example is attached.
~Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.3.001" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="136">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="136"/>
<operator activated="true" class="aggregate" compatibility="7.3.001" expanded="true" height="82" name="Aggregate" width="90" x="380" y="34">
<list key="aggregation_attributes">
<parameter key="Age" value="average"/>
</list>
<parameter key="group_by_attributes" value="Sex"/>
</operator>
<operator activated="true" class="join" compatibility="7.3.001" expanded="true" height="82" name="Join" width="90" x="581" y="136">
<parameter key="join_type" value="right"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="Sex" value="Sex"/>
</list>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="136">
<list key="function_descriptions">
<parameter key="Age" value="if(missing(Age),[average(Age)],Age)"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="916" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="average(Age)"/>
<parameter key="invert_selection" value="true"/>
</operator>
<connect from_op="Retrieve Titanic" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>2 -
Thanks Man! That Worked!
Sub Process
Just one more question
If I wanted to do imputation using group by on two columns (Sex, Passenger Class), will I need to build two Loop Value operators?
EDIT:
Your edit answers my second query as well. Thanks
0 -
If you want to go for the Loop Values way with more than one group by col you can create an attribute like
concat(group_by_att_a,group_by_att_b)
and use it on it.
While loop values is a bit more flexible i would recommend the aggregate way if possible. It's simply faster.
~Martin
0