Remove attributes with missing values exceeding a given threshold (percentage)
Hi, I'm new to Rapid Miner. I'm trying to do something very simple but I'm stuck with it. Given my data collection with many attributes I want to remove columns in which there are more than a given percentage of missing values (because I would not be able to use fixed values or infer their values). I tried the Remove Useless Attributes node but still I have columns with almost 90% of missing values so it didn't work as I wanted. Can you help me achieve what I want? It should be something trivial, I remember in Knime there was a specific option in the filter node to specify the percentage threshold.
Thank you!
Find more posts tagged with
Sort by:
1 - 7 of
71
Hi @mortiz,
It is very easy with TURBO PREP :
- Open your dataset with Turbo Prep
- Click on CLEANSE
- Click on REMOVE LOW QUALITY
- Set the Max missing (%)
- Click on COMMIT CLEANSE
Hope this helps,
Regards,
Lionel
It is very easy with TURBO PREP :
- Open your dataset with Turbo Prep
- Click on CLEANSE
- Click on REMOVE LOW QUALITY
- Set the Max missing (%)
- Click on COMMIT CLEANSE
Hope this helps,
Regards,
Lionel
Hi again @mortiz,
If you don't have access to TURBO PREP, your task can be easily performed by a very simple Python script.
To execute this process, you will need to :
- Install Python on your computer.
- Install the Python Scripting extension from the MarketPlace.
- Set the Max Missing (%) values in a attribute (for this set the threshold called thr in the Set Macros operator).
The Process :
If you don't have access to TURBO PREP, your task can be easily performed by a very simple Python script.
To execute this process, you will need to :
- Install Python on your computer.
- Install the Python Scripting extension from the MarketPlace.
- Set the Max Missing (%) values in a attribute (for this set the threshold called thr in the Set Macros operator).
The Process :
<?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-BETA"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.4.000-BETA" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.4.000-BETA" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="Att_1,Att_2 1,1 2,2 3,3 4,4 5,5 6, 7, 8, 9, 10,"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="set_macros" compatibility="9.4.000-BETA" expanded="true" height="82" name="Set Macros" width="90" x="246" y="85"> <list key="macros"> <parameter key="thr" value="0.6"/> </list> </operator> <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.001" expanded="true" height="103" name="Execute Python" width="90" x="380" y="85"> <parameter key="script" value="import pandas as pd threshold = %{thr} # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): data = data.dropna(thresh=threshold*len(data), axis=1) # connect 2 output ports to see the results return data"/> <parameter key="notebook_cell_tag_filter" value=""/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Set Macros" to_port="through 1"/> <connect from_op="Set Macros" from_port="through 1" to_op="Execute Python" to_port="input 1"/> <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Regards,
Lionel
Lionel
Hi @Moritz, @lionelderkrikor,
there is a operator in toolbox called Select Attributes (Missings) or something like that which does the trick.
BR
Martin
there is a operator in toolbox called Select Attributes (Missings) or something like that which does the trick.
BR
Martin
Nice suggestion @mschmitz !
The operator is called : Filter Attributes with Missing Values.
Thanks,
Regards,
Lionel
The operator is called : Filter Attributes with Missing Values.
Thanks,
Regards,
Lionel
There are probably a few different ways of doing it, but the easiest I can come up with is using the "Remove Useless Attributes" operator. Please take a look at the example process below (just copy it and paste it into your XML panel, then click the green checkmark):