Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Remove attributes with missing values exceeding a given threshold (percentage)

Hi, I'm new to Rapid Miner. I'm trying to do something very simple but I'm stuck with it. Given my data collection with many attributes I want to remove columns in which there are more than a given percentage of missing values (because I would not be able to use fixed values or infer their values). I tried the Remove Useless Attributes node but still I have columns with almost 90% of missing values so it didn't work as I wanted. Can you help me achieve what I want? It should be something trivial, I remember in Knime there was a specific option in the filter node to specify the percentage threshold.

Thank you!

Find more posts tagged with

AI Studio

Getting Started

Missing Values

Accepted answers

All comments

FBT

There are probably a few different ways of doing it, but the easiest I can come up with is using the "Remove Useless Attributes" operator. Please take a look at the example process below (just copy it and paste it into your XML panel, then click the green checkmark):

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Golf" width="90" x="45" y="136">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="136">
        <list key="function_descriptions">
          <parameter key="Missing Value" value="if([Outlook] == &quot;sunny&quot;,1, MISSING_NOMINAL)"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="136">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Outlook|Wind"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <operator activated="true" breakpoints="after" class="filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="first 10 examples" width="90" x="447" y="136">
        <parameter key="first_example" value="1"/>
        <parameter key="last_example" value="10"/>
      </operator>
      <operator activated="true" class="remove_useless_attributes" compatibility="7.6.001" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="581" y="136">
        <parameter key="nominal_useless_above" value="0.6"/>
        <parameter key="nominal_useless_below" value="0.5"/>
      </operator>
      <connect from_op="Golf" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="first 10 examples" to_port="example set input"/>
      <connect from_op="first 10 examples" from_port="example set output" to_op="Remove Useless Attributes" to_port="example set input"/>
      <connect from_op="Remove Useless Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="90"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

mortiz

I have the same question. If I have 100 attributes and 20 of them are missing 60% of the values, how can I easily scrub them out? The remove useless attribute operator doesn't seem to help with this.

lionelderkrikor

Hi @mortiz,

It is very easy with TURBO PREP :

- Open your dataset with Turbo Prep
- Click on CLEANSE
- Click on REMOVE LOW QUALITY
- Set the Max missing (%)
- Click on COMMIT CLEANSE

Hope this helps,

Regards,

Lionel

lionelderkrikor

Hi again @mortiz,

If you don't have access to TURBO PREP, your task can be easily performed by a very simple Python script.
To execute this process, you will need to :

- Install Python on your computer.
- Install the Python Scripting extension from the MarketPlace.
- Set the Max Missing (%) values in a attribute (for this set the threshold called thr in the Set Macros operator).

The Process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.4.000-BETA" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="utility:create_exampleset" compatibility="9.4.000-BETA" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
        <parameter key="generator_type" value="comma separated text"/>
        <parameter key="number_of_examples" value="100"/>
        <parameter key="use_stepsize" value="false"/>
        <list key="function_descriptions"/>
        <parameter key="add_id_attribute" value="false"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="input_csv_text" value="Att_1,Att_2&#10;1,1&#10;2,2&#10;3,3&#10;4,4&#10;5,5&#10;6,&#10;7,&#10;8,&#10;9,&#10;10,"/>
        <parameter key="column_separator" value=","/>
        <parameter key="parse_all_as_nominal" value="false"/>
        <parameter key="decimal_point_character" value="."/>
        <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="set_macros" compatibility="9.4.000-BETA" expanded="true" height="82" name="Set Macros" width="90" x="246" y="85">
        <list key="macros">
          <parameter key="thr" value="0.6"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.001" expanded="true" height="103" name="Execute Python" width="90" x="380" y="85">
        <parameter key="script" value="import pandas as pd&#10;&#10;threshold = %{thr}&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;  data = data.dropna(thresh=threshold*len(data), axis=1)&#10;    &#10;&#10;    # connect 2 output ports to see the results&#10;  return data"/>
        <parameter key="notebook_cell_tag_filter" value=""/>
        <parameter key="use_default_python" value="true"/>
        <parameter key="package_manager" value="conda (anaconda)"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Set Macros" to_port="through 1"/>
      <connect from_op="Set Macros" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards,

Lionel

MartinLiebig

Hi @Moritz, @lionelderkrikor,

there is a operator in toolbox called Select Attributes (Missings) or something like that which does the trick.

BR
Martin

lionelderkrikor

Nice suggestion @mschmitz !

The operator is called : Filter Attributes with Missing Values.

Thanks,

Regards,

Lionel

mortiz

Thank you for the help. The Turbo Prep option seemed to help