Fill in Missing Values by average of whole row

msacs09
msacs09 New Altair Community Member
edited November 2024 in Community Q&A
I'm trying to fill in the missing values by averaging the rows or possibly use some row level computation like regression. I'm wondering if i need to do un-pivot it? I would greatly appreciate a sample process that can do this . Below is my sample row level data 


Answers

  • SGolbert
    SGolbert New Altair Community Member
    edited February 2019

    You can do it with the Generate Aggregation operator and a Loop. Something like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"><br>    <parameter key="logverbosity" value="init"/><br>    <parameter key="random_seed" value="2001"/><br>    <parameter key="send_mail" value="never"/><br>    <parameter key="notification_email" value=""/><br>    <parameter key="process_duration_for_mail" value="30"/><br>    <parameter key="encoding" value="SYSTEM"/><br>    <process expanded="true"><br>      <operator activated="true" class="read_excel" compatibility="9.1.000" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34"><br>        <parameter key="excel_file" value="C:\Users\sgolbert\Downloads\Sample_Data_1.xlsx"/><br>        <parameter key="sheet_selection" value="sheet number"/><br>        <parameter key="sheet_number" value="1"/><br>        <parameter key="imported_cell_range" value="A1"/><br>        <parameter key="encoding" value="SYSTEM"/><br>        <parameter key="first_row_as_names" value="true"/><br>        <list key="annotations"/><br>        <parameter key="date_format" value=""/><br>        <parameter key="time_zone" value="SYSTEM"/><br>        <parameter key="locale" value="English (United States)"/><br>        <parameter key="read_all_values_as_polynominal" value="false"/><br>        <list key="data_set_meta_data_information"><br>          <parameter key="0" value="ID.true.integer.id"/><br>          <parameter key="1" value="2017-01.true.integer.attribute"/><br>          <parameter key="2" value="2017-02.true.integer.attribute"/><br>          <parameter key="3" value="2017-03.true.integer.attribute"/><br>          <parameter key="4" value="2017-04.true.integer.attribute"/><br>          <parameter key="5" value="2017-05.true.integer.attribute"/><br>          <parameter key="6" value="2017-06.true.integer.attribute"/><br>          <parameter key="7" value="2017-07.true.integer.attribute"/><br>          <parameter key="8" value="2017-08.true.integer.attribute"/><br>          <parameter key="9" value="2017-09.true.integer.attribute"/><br>          <parameter key="10" value="2017-10.true.integer.attribute"/><br>          <parameter key="11" value="2017-11.true.integer.attribute"/><br>          <parameter key="12" value="2017-12.true.integer.attribute"/><br>          <parameter key="13" value="2018-01.true.integer.attribute"/><br>          <parameter key="14" value="2018-02.true.integer.attribute"/><br>          <parameter key="15" value="2018-03.true.integer.attribute"/><br>          <parameter key="16" value="2018-04.true.integer.attribute"/><br>          <parameter key="17" value="2018-05.true.integer.attribute"/><br>          <parameter key="18" value="2018-06.true.integer.attribute"/><br>          <parameter key="19" value="2018-07.true.integer.attribute"/><br>          <parameter key="20" value="2018-08.true.integer.attribute"/><br>          <parameter key="21" value="2018-09.true.integer.attribute"/><br>          <parameter key="22" value="2018-10.true.integer.attribute"/><br>          <parameter key="23" value="2018-11.true.integer.attribute"/><br>          <parameter key="24" value="2018-12.true.integer.attribute"/><br>        </list><br>        <parameter key="read_not_matching_values_as_missings" value="false"/><br>        <parameter key="datamanagement" value="double_array"/><br>        <parameter key="data_management" value="auto"/><br>      </operator><br>      <operator activated="true" class="generate_aggregation" compatibility="9.1.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="246" y="34"><br>        <parameter key="attribute_name" value="row_mean"/><br>        <parameter key="attribute_filter_type" value="all"/><br>        <parameter key="attribute" value=""/><br>        <parameter key="attributes" value=""/><br>        <parameter key="use_except_expression" value="false"/><br>        <parameter key="value_type" value="attribute_value"/><br>        <parameter key="use_value_type_exception" value="false"/><br>        <parameter key="except_value_type" value="time"/><br>        <parameter key="block_type" value="attribute_block"/><br>        <parameter key="use_block_type_exception" value="false"/><br>        <parameter key="except_block_type" value="value_matrix_row_start"/><br>        <parameter key="invert_selection" value="false"/><br>        <parameter key="include_special_attributes" value="false"/><br>        <parameter key="aggregation_function" value="average"/><br>        <parameter key="concatenation_separator" value="|"/><br>        <parameter key="keep_all" value="true"/><br>        <parameter key="ignore_missings" value="true"/><br>        <parameter key="ignore_missing_attributes" value="false"/><br>      </operator><br>      <operator activated="true" class="numerical_to_real" compatibility="9.1.000" expanded="true" height="82" name="Numerical to Real" width="90" x="380" y="34"><br>        <parameter key="attribute_filter_type" value="all"/><br>        <parameter key="attribute" value=""/><br>        <parameter key="attributes" value=""/><br>        <parameter key="use_except_expression" value="false"/><br>        <parameter key="value_type" value="numeric"/><br>        <parameter key="use_value_type_exception" value="false"/><br>        <parameter key="except_value_type" value="real"/><br>        <parameter key="block_type" value="value_series"/><br>        <parameter key="use_block_type_exception" value="false"/><br>        <parameter key="except_block_type" value="value_series_end"/><br>        <parameter key="invert_selection" value="false"/><br>        <parameter key="include_special_attributes" value="false"/><br>      </operator><br>      <operator activated="true" class="extract_macro" compatibility="9.1.000" expanded="true" height="68" name="Extract Macro (2)" width="90" x="514" y="34"><br>        <parameter key="macro" value="sample_size"/><br>        <parameter key="macro_type" value="number_of_examples"/><br>        <parameter key="statistics" value="average"/><br>        <parameter key="attribute_name" value=""/><br>        <list key="additional_macros"/><br>      </operator><br>      <operator activated="true" class="concurrency:loop" compatibility="9.1.000" expanded="true" height="82" name="Loop" width="90" x="648" y="34"><br>        <parameter key="number_of_iterations" value="%{sample_size}"/><br>        <parameter key="iteration_macro" value="iteration"/><br>        <parameter key="reuse_results" value="false"/><br>        <parameter key="enable_parallel_execution" value="true"/><br>        <process expanded="true"><br>          <operator activated="true" class="filter_example_range" compatibility="9.1.000" expanded="true" height="82" name="Filter Example Range" width="90" x="112" y="34"><br>            <parameter key="first_example" value="%{iteration}"/><br>            <parameter key="last_example" value="%{iteration}"/><br>            <parameter key="invert_filter" value="false"/><br>          </operator><br>          <operator activated="true" class="extract_macro" compatibility="9.1.000" expanded="true" height="68" name="Extract Macro" width="90" x="380" y="34"><br>            <parameter key="macro" value="replacement"/><br>            <parameter key="macro_type" value="data_value"/><br>            <parameter key="statistics" value="average"/><br>            <parameter key="attribute_name" value="row_mean"/><br>            <parameter key="example_index" value="1"/><br>            <list key="additional_macros"/><br>          </operator><br>          <operator activated="true" class="replace_missing_values" compatibility="9.1.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="648" y="34"><br>            <parameter key="return_preprocessing_model" value="false"/><br>            <parameter key="create_view" value="false"/><br>            <parameter key="attribute_filter_type" value="all"/><br>            <parameter key="attribute" value=""/><br>            <parameter key="attributes" value=""/><br>            <parameter key="use_except_expression" value="false"/><br>            <parameter key="value_type" value="attribute_value"/><br>            <parameter key="use_value_type_exception" value="false"/><br>            <parameter key="except_value_type" value="time"/><br>            <parameter key="block_type" value="attribute_block"/><br>            <parameter key="use_block_type_exception" value="false"/><br>            <parameter key="except_block_type" value="value_matrix_row_start"/><br>            <parameter key="invert_selection" value="false"/><br>            <parameter key="include_special_attributes" value="false"/><br>            <parameter key="default" value="value"/><br>            <list key="columns"/><br>            <parameter key="replenishment_value" value="%{replacement}"/><br>          </operator><br>          <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/><br>          <connect from_op="Filter Example Range" from_port="example set output" to_op="Extract Macro" to_port="example set"/><br>          <connect from_op="Extract Macro" from_port="example set" to_op="Replace Missing Values" to_port="example set input"/><br>          <connect from_op="Replace Missing Values" from_port="example set output" to_port="output 1"/><br>          <portSpacing port="source_input 1" spacing="0"/><br>          <portSpacing port="source_input 2" spacing="0"/><br>          <portSpacing port="sink_output 1" spacing="0"/><br>          <portSpacing port="sink_output 2" spacing="0"/><br>        </process><br>      </operator><br>      <operator activated="true" class="append" compatibility="9.1.000" expanded="true" height="82" name="Append" width="90" x="782" y="34"><br>        <parameter key="datamanagement" value="double_array"/><br>        <parameter key="data_management" value="auto"/><br>        <parameter key="merge_type" value="all"/><br>      </operator><br>      <connect from_op="Read Excel" from_port="output" to_op="Generate Aggregation" to_port="example set input"/><br>      <connect from_op="Generate Aggregation" from_port="example set output" to_op="Numerical to Real" to_port="example set input"/><br>      <connect from_op="Numerical to Real" from_port="example set output" to_op="Extract Macro (2)" to_port="example set"/><br>      <connect from_op="Extract Macro (2)" from_port="example set" to_op="Loop" to_port="input 1"/><br>      <connect from_op="Loop" from_port="output 1" to_op="Append" to_port="example set 1"/><br>      <connect from_op="Append" from_port="merged set" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>    </process><br>  </operator><br></process><br><br>

    It is a bit clumsy, but I didn't find an easier way.

    Regards,
    Sebastian


  • msacs09
    msacs09 New Altair Community Member
    @SGolbert Thank you. If I may ask, how do i replace the average here with a  regression model such as KNN or some other learner, so that we impute with a regression value as opposed to average, such that we do not repeat the SAME value for every missing row for that ID?

    Thank you for your valuable time.
  • SGolbert
    SGolbert New Altair Community Member
    Hi msacs,

    you could do something very similar, but just use a prediction column (after applying the model) instead of the row_mean column. Another option is the Impute Missing Values operator, that does practically the same, all in one go.

    Do you want to try it out?

    Regards,
    Sebastian

  • msacs09
    msacs09 New Altair Community Member
    It appears that the imputation operator has a bug. It complains about missing exampleset as reported here. So I was curious if your approach would resolve that issue. 

    https://community.rapidminer.com/discussion/54470/imputing-challenge-with-a-best-possible-ensemble#latest
  • sgenzer
    sgenzer
    Altair Employee
    nope no bug. See my response in other thread.

    Scott