Remove attributes with missing values exceeding a given threshold (percentage)

f_laperna
f_laperna New Altair Community Member
edited November 2024 in Community Q&A

Hi, I'm new to Rapid Miner. I'm trying to do something very simple but I'm stuck with it. Given my data collection with many attributes I want to remove columns in which there are more than a given percentage of missing values (because I would not be able to use fixed values or infer their values). I tried the Remove Useless Attributes node but still I have columns with almost 90% of missing values so it didn't work as I wanted. Can you help me achieve what I want? It should be something trivial, I remember in Knime there was a specific option in the filter node to specify the percentage threshold.

 

Thank you!

Answers

  • FBT
    FBT New Altair Community Member

    There are probably a few different ways of doing it, but the easiest I can come up with is using the "Remove Useless Attributes" operator. Please take a look at the example process below (just copy it and paste it into your XML panel, then click the green checkmark):

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Golf" width="90" x="45" y="136">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="136">
    <list key="function_descriptions">
    <parameter key="Missing Value" value="if([Outlook] == &quot;sunny&quot;,1, MISSING_NOMINAL)"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="136">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Outlook|Wind"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" breakpoints="after" class="filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="first 10 examples" width="90" x="447" y="136">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="10"/>
    </operator>
    <operator activated="true" class="remove_useless_attributes" compatibility="7.6.001" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="581" y="136">
    <parameter key="nominal_useless_above" value="0.6"/>
    <parameter key="nominal_useless_below" value="0.5"/>
    </operator>
    <connect from_op="Golf" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="first 10 examples" to_port="example set input"/>
    <connect from_op="first 10 examples" from_port="example set output" to_op="Remove Useless Attributes" to_port="example set input"/>
    <connect from_op="Remove Useless Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="90"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • mortiz
    mortiz New Altair Community Member
    I have the same question. If I have 100 attributes and 20 of them are missing 60% of the values, how can I easily scrub them out? The remove useless attribute operator doesn't seem to help with this.
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @mortiz,

    It is very easy with TURBO PREP : 

     - Open your dataset with Turbo Prep
     - Click on CLEANSE
     - Click on REMOVE LOW QUALITY
     - Set the Max missing (%)
     - Click on COMMIT CLEANSE 

    Hope this helps,

    Regards,

    Lionel
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi again @mortiz,

    If you don't have access to TURBO PREP, your task can be easily performed by a very simple Python script.
    To execute this process, you will need to : 

     - Install Python on your computer.
     - Install the Python Scripting extension from the MarketPlace.
     - Set the  Max Missing (%) values in a attribute (for this set the threshold called thr in the Set Macros operator).

    The Process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-BETA">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.4.000-BETA" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.4.000-BETA" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="Att_1,Att_2&#10;1,1&#10;2,2&#10;3,3&#10;4,4&#10;5,5&#10;6,&#10;7,&#10;8,&#10;9,&#10;10,"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="set_macros" compatibility="9.4.000-BETA" expanded="true" height="82" name="Set Macros" width="90" x="246" y="85">
            <list key="macros">
              <parameter key="thr" value="0.6"/>
            </list>
          </operator>
          <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.001" expanded="true" height="103" name="Execute Python" width="90" x="380" y="85">
            <parameter key="script" value="import pandas as pd&#10;&#10;threshold = %{thr}&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;  data = data.dropna(thresh=threshold*len(data), axis=1)&#10;    &#10;&#10;    # connect 2 output ports to see the results&#10;  return data"/>
            <parameter key="notebook_cell_tag_filter" value=""/>
            <parameter key="use_default_python" value="true"/>
            <parameter key="package_manager" value="conda (anaconda)"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Set Macros" to_port="through 1"/>
          <connect from_op="Set Macros" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
          <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Regards,

    Lionel



  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi @Moritz, @lionelderkrikor,

    there is a operator in toolbox called Select Attributes (Missings) or something like that which does the trick.

    BR
    Martin
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Nice suggestion @mschmitz

    The operator is called : Filter Attributes with Missing Values.

    Thanks,

    Regards,

    Lionel
  • mortiz
    mortiz New Altair Community Member
    Thank you for the help. The Turbo Prep option seemed to help