Easiest way to remove empty rows and attributes?

kayman
kayman New Altair Community Member
edited November 2024 in Community Q&A

 

Hi,

What would be the easiest way to remove empty columns / rows in a dataset?

My data is imported from source data that I can't easily modify before load so I prefer to deal with it within rapidminer.

 

What I'm looking at is a simple way to remove both Col1 and Row 1 in below example, but in reality I can have more than a few empty rows and columns in much more heavy tables. I've tried to use the filter missing / no missing attributes but it basically removes everything as soon as it finds a missing value so not exactly what I need.

 

  Col1 Col2 Col3
1 ? ? ?
2 ? ? Some data
3 ? ? Some data
4 ? ? Some data
5 ? Some data ?

 

For now I have a fairly heavy workflow where I first create an id, then multiply my data and have one set where I replace all missing values with a 0, everything else with 1 and then use aggregates to sum and remove every row where sum = 0. Next I loop through all my attributes, do something similar and remove all columns where the aggregated sum is 0 again. 

 

It does the trick but seems a bit overkill, so I'm wondering if I'm missing some easy way to deal with this.

Tagged:

Best Answer

  • tftemme
    tftemme New Altair Community Member
    Answer ✓

    Hi @kayman, Hi @mschmitz,

     

    From my point of view this would be a handy addition to the Toolbox.

    I will have a look, if I find time for this.

     

    Best regards,
    Fabian

Answers

  • Maerkli
    Maerkli New Altair Community Member

    Hallo Kayman,

     

    Could Select Attributes be a solution?

    Maerkli

  • kayman
    kayman New Altair Community Member

    Hi Maerkly.

     

    I don't thinks so, the behaviour of the 'no_missing_values' seems to be that as soon as one value is missing you (rightfully) get a hit, so in this case also Col2 and Col3 would be impacted since they have at least one missing value. 

     

    I'd need kind of the opposite, as soon as there is a value it's ok, otherwise skip. Or something like 'if all values are missing' skip, otherwise keep.

  • Maerkli
    Maerkli New Altair Community Member

    I am back, Kayman. It is just a suggestion: have a look at RM Studio/Repository/Training Resources/ETL-Advanced/Aggregations/Hotel App Aggregations & Pivoting Solution. With Pivot, you get another perspective on your data and you can apply another treatment like Replace Missing Values. Keep us posted.

    Maerkli

  • Telcontar120
    Telcontar120 New Altair Community Member

    I think you may be making this more complicated than necessary.  There are two simple ETL operators that can solve these issues. Remove Useless Attributes will do the trick for you to get rid of attributes (columns) that are all missing (either nominal or numerical), and Filter Examples has a pre-defined condition in the dropdown for "no missing attributes" that should get rid of all your blank examples (rows).

  • kayman
    kayman New Altair Community Member

    Not really, as the missing attributes operator just looks if there is at least one missing value, where I need to have the option to see if ALL values are missing. I don't care for the occasional missing value, just if my whole example or column is missing. 

     

    So if I use the missing attributes operator and loop it through my collection, only the sets without any single missing attribute are returned, which is what the operator promised but not what I wanted to see...

  • Telcontar120
    Telcontar120 New Altair Community Member

    @kayman Got it, but I think it can still be done with the two operators I describe.  In Filter Examples you just have to set the conditions to each attribute to be "is missing" with "match all" logic and then invert the operator.  No looping is needed with either.  I think that the the attached process does exactly what you want to do (using an exampleset similar to your original sample) with just the two operators I referenced.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" breakpoints="after" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="Row,Att1,Att2,Att3&#10;1,4,7,&#10;2,6,,&#10;3,5,8,&#10;,,,&#10;5,2,,&#10;"/>
    </operator>
    <operator activated="true" breakpoints="after" class="remove_useless_attributes" compatibility="9.0.003" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="246" y="34"/>
    <operator activated="true" breakpoints="after" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
    <parameter key="invert_filter" value="true"/>
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Row.is_missing."/>
    <parameter key="filters_entry_key" value="Att1.is_missing."/>
    <parameter key="filters_entry_key" value="Att2.is_missing."/>
    </list>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Remove Useless Attributes" to_port="example set input"/>
    <connect from_op="Remove Useless Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • Telcontar120
    Telcontar120 New Altair Community Member

    And if you don't want to declare a separate condition for each attribute (maybe you have a lot of attributes in your real dataset) then you can get around that pretty easily by using Generate Aggregation to count the nonmissings and then filter purely based on that, as in the attached process.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" breakpoints="after" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="Row,Att1,Att2,Att3,Att4&#10;1,4,7,,&quot;yes&quot;&#10;2,6,,,&quot;no&quot;&#10;3,5,8,,&quot;yes&quot;&#10;,,,,&#10;5,,2,,&quot;green&quot;&#10;"/>
    </operator>
    <operator activated="true" breakpoints="after" class="trim" compatibility="9.0.003" expanded="true" height="82" name="Trim" width="90" x="246" y="34"/>
    <operator activated="true" breakpoints="after" class="remove_useless_attributes" compatibility="9.0.003" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="380" y="34"/>
    <operator activated="true" breakpoints="after" class="generate_aggregation" compatibility="9.0.003" expanded="true" height="82" name="Generate Aggregation" width="90" x="514" y="34">
    <parameter key="attribute_name" value="count"/>
    <parameter key="aggregation_function" value="count"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="648" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="count.gt.0"/>
    </list>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Trim" to_port="example set input"/>
    <connect from_op="Trim" from_port="example set output" to_op="Remove Useless Attributes" to_port="example set input"/>
    <connect from_op="Remove Useless Attributes" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
    <connect from_op="Generate Aggregation" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • kayman
    kayman New Altair Community Member

    Yeah, but my data is unfortunatly not that simple :-)

    I do not know up front how many attributes I have, let alone the name, so I'm back to looping through my content again.

     

    Also, the remove useless atributes has the unwanted side effect that when my column actually contains too many times the same content it will also be removed.

    So if I have something like

     

    COL1,COL2,COL3

    ?,1,2

    ?,2,2

     

    both COL1 and COL3 will be removed. Which is correct behaviour for the operator but again a bit too greedy for my purpose.

    Seems afterall that for a simple problem it requires a complex solution... 

     

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

    i usually go for something like aggregate without missings - #examples and check if this is 0. If yes, i do data to weights and got something.

     

    Maybe this is not too straight foward and we should put something into toolbox for it? 

    BR,

    Martin

     

    CC: @tftemme

  • kayman
    kayman New Altair Community Member

    Hi Martin, that's my current strategy also (and even having some mongo in the middle :-))

    I was just wondering if there were easier ways, as this looks to me like quite a common scenario.

     

    I can help myself, but if you can add it in the toolbox it would be handy indeed.

  • tftemme
    tftemme New Altair Community Member
    Answer ✓

    Hi @kayman, Hi @mschmitz,

     

    From my point of view this would be a handy addition to the Toolbox.

    I will have a look, if I find time for this.

     

    Best regards,
    Fabian

  • tftemme
    tftemme New Altair Community Member
    Hi @kayman ,

    We just released version 1.6.0 of the operator toolbox extension with the new operators
    - Filter Attributes with Missing Values
    - Filter Examples with Missing Values

    They offer quite some possiblities to configure which Attributes/Examples should be kept (e.g. maximum absolut or relative number of missing, one non-missing etc) and should catch your problem.

    Feel free to update ;-)

    Here is the marketplace link: https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_operator_toolbox

    Best regards
    Fabian
  • kayman
    kayman New Altair Community Member
    Works like a dream, thanks a lot !