Imputing Nulls using Grouped by values

New Altair Community Member

Jan 17, 2017

Updated Nov 5, 2024 by Jocelyn

Hello Community,

Can we impute Null values with Grouped by some column of Non Null Values?

Eg : In titanic problem, I want to impute Age column with the mean age grouped by Sex. Now, this logic can be as complex as we want. Is there a functionality in RM to write our own rules for imputing Nulls?

P.S. I have seen the Imputing Missing Values operator. KNN may not work for all cases so we may need custom rules.

Thanks

Find more posts tagged with

AI Studio

Getting Started

Sort by:

1 - 3 of 31

MartinLiebig

Altair Employee

Accepted Answer

Jan 17, 2017

Hi,

what about loop values, filter example, replace missings?

Edit: a quicker way uses Aggregate and join. An example is attached.

~Martin

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.3.001" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="136">
        <parameter key="repository_entry" value="//Samples/data/Titanic"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="136"/>
      <operator activated="true" class="aggregate" compatibility="7.3.001" expanded="true" height="82" name="Aggregate" width="90" x="380" y="34">
        <list key="aggregation_attributes">
          <parameter key="Age" value="average"/>
        </list>
        <parameter key="group_by_attributes" value="Sex"/>
      </operator>
      <operator activated="true" class="join" compatibility="7.3.001" expanded="true" height="82" name="Join" width="90" x="581" y="136">
        <parameter key="join_type" value="right"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="Sex" value="Sex"/>
        </list>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="136">
        <list key="function_descriptions">
          <parameter key="Age" value="if(missing(Age),[average(Age)],Age)"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="916" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="average(Age)"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <connect from_op="Retrieve Titanic" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
      <connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

garg2409

New Altair Community Member

Jan 17, 2017

Thanks Man! That Worked!

Sub Process

Just one more question

If I wanted to do imputation using group by on two columns (Sex, Passenger Class), will I need to build two Loop Value operators?

EDIT:

Your edit answers my second query as well. Thanks

Altair Employee

If you want to go for the Loop Values way with more than one group by col you can create an attribute like

concat(group_by_att_a,group_by_att_b)

and use it on it.

While loop values is a bit more flexible i would recommend the aggregate way if possible. It's simply faster.

~Martin

Sort by:

1 - 1 of 11

MartinLiebig

Altair Employee

Accepted Answer

Jan 17, 2017

Hi,

what about loop values, filter example, replace missings?

Edit: a quicker way uses Aggregate and join. An example is attached.

~Martin

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.3.001" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="136">
        <parameter key="repository_entry" value="//Samples/data/Titanic"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="136"/>
      <operator activated="true" class="aggregate" compatibility="7.3.001" expanded="true" height="82" name="Aggregate" width="90" x="380" y="34">
        <list key="aggregation_attributes">
          <parameter key="Age" value="average"/>
        </list>
        <parameter key="group_by_attributes" value="Sex"/>
      </operator>
      <operator activated="true" class="join" compatibility="7.3.001" expanded="true" height="82" name="Join" width="90" x="581" y="136">
        <parameter key="join_type" value="right"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="Sex" value="Sex"/>
        </list>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="136">
        <list key="function_descriptions">
          <parameter key="Age" value="if(missing(Age),[average(Age)],Age)"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="916" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="average(Age)"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <connect from_op="Retrieve Titanic" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
      <connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

View in context

Imputing Nulls using Grouped by values

Find more posts tagged with

Quick Links