Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Hi All, I need to remove duplicates from each cell of an attribute/column. Do we have a quick soluti

I am new to rapidminer and i'm working on a huge project in my company, hence require your help here.

The below is what i need to implement

Problem:
COLUMN
A|B|V|A|B
C|V|B|C
E|R|T|Y|E

Solution required:
COLUMN
A|B|V
C|V|B
E|R|T|Y

I need a solution as above where i am removing the duplicate entries in the cell separated by "|".

Appreciate your help on this.

Find more posts tagged with

AI Studio

Accepted answers

All comments

lionelderkrikor

Hi @Achint
Does this process answer to your need ?

Image: https://us.v-cdn.net/6030995/uploads/editor/qy/acjeq8wpxmxa.png

I add "missing" because when I try to concatenate a valid attribute with an attribute containing a missing value (?)
the result is "?".

The process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.7.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="85">
        <parameter key="generator_type" value="comma_separated_text"/>
        <parameter key="number_of_examples" value="100"/>
        <parameter key="use_stepsize" value="false"/>
        <list key="function_descriptions"/>
        <parameter key="add_id_attribute" value="false"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <parameter key="input_csv_text" value="Id,Att1&#10;1,A|B|V|A|B&#10;2,C|V|B|C&#10;3,E|R|T|Y|E"/>
        <parameter key="column_separator" value=","/>
        <parameter key="parse_all_as_nominal" value="false"/>
        <parameter key="decimal_point_character" value="."/>
        <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="split" compatibility="9.1.000" expanded="true" height="82" name="Split" width="90" x="179" y="85">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="split_pattern" value="[|]"/>
        <parameter key="split_mode" value="ordered_split"/>
      </operator>
      <operator activated="true" class="transpose" compatibility="9.1.000" expanded="true" height="82" name="Transpose" width="90" x="313" y="85"/>
      <operator activated="true" class="concurrency:loop_attributes" compatibility="9.1.000" expanded="true" height="82" name="Loop Attributes" width="90" x="447" y="85">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="attribute_name_macro" value="loop_attribute"/>
        <parameter key="reuse_results" value="false"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="select_attributes" compatibility="9.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="112" y="85">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="%{loop_attribute}"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="remove_duplicates" compatibility="9.1.000" expanded="true" height="103" name="Remove Duplicates" width="90" x="313" y="85">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="%{loop_attribute}"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="treat_missing_values_as_duplicates" value="false"/>
          </operator>
          <operator activated="true" class="transpose" compatibility="9.1.000" expanded="true" height="82" name="Transpose (2)" width="90" x="447" y="85"/>
          <connect from_port="input 1" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_op="Transpose (2)" to_port="example set input"/>
          <connect from_op="Transpose (2)" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="subprocess" compatibility="9.1.000" expanded="true" height="82" name="Union Append" origin="GENERATED_COMMUNITY" width="90" x="581" y="85">
        <process expanded="true">
          <operator activated="true" class="loop_collection" compatibility="9.1.000" expanded="true" height="82" name="Output (4)" origin="GENERATED_COMMUNITY" width="90" x="45" y="34">
            <parameter key="set_iteration_macro" value="true"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="false" breakpoints="after" class="select" compatibility="9.1.000" expanded="true" height="68" name="Select (5)" origin="GENERATED_COMMUNITY" width="90" x="112" y="34">
                <parameter key="index" value="%{iteration}"/>
                <parameter key="unfold" value="false"/>
              </operator>
              <operator activated="true" class="branch" compatibility="9.1.000" expanded="true" height="82" name="Branch (2)" origin="GENERATED_COMMUNITY" width="90" x="313" y="34">
                <parameter key="condition_type" value="expression"/>
                <parameter key="expression" value="%{iteration}==1"/>
                <parameter key="io_object" value="ANOVAMatrix"/>
                <parameter key="return_inner_output" value="true"/>
                <process expanded="true">
                  <connect from_port="condition" to_port="input 1"/>
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                </process>
                <process expanded="true">
                  <operator activated="true" class="recall" compatibility="9.1.000" expanded="true" height="68" name="Recall (5)" origin="GENERATED_COMMUNITY" width="90" x="45" y="187">
                    <parameter key="name" value="LoopData"/>
                    <parameter key="io_object" value="ExampleSet"/>
                    <parameter key="remove_from_store" value="true"/>
                  </operator>
                  <operator activated="true" class="union" compatibility="9.1.000" expanded="true" height="82" name="Union (2)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34"/>
                  <connect from_port="condition" to_op="Union (2)" to_port="example set 1"/>
                  <connect from_op="Recall (5)" from_port="result" to_op="Union (2)" to_port="example set 2"/>
                  <connect from_op="Union (2)" from_port="union" to_port="input 1"/>
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="remember" compatibility="9.1.000" expanded="true" height="68" name="Remember (5)" origin="GENERATED_COMMUNITY" width="90" x="581" y="34">
                <parameter key="name" value="LoopData"/>
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="store_which" value="1"/>
                <parameter key="remove_from_process" value="true"/>
              </operator>
              <connect from_port="single" to_op="Branch (2)" to_port="condition"/>
              <connect from_op="Branch (2)" from_port="input 1" to_op="Remember (5)" to_port="store"/>
              <connect from_op="Remember (5)" from_port="stored" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="select" compatibility="9.1.000" expanded="true" height="68" name="Select (6)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34">
            <parameter key="index" value="%{iteration}"/>
            <parameter key="unfold" value="false"/>
          </operator>
          <connect from_port="in 1" to_op="Output (4)" to_port="collection"/>
          <connect from_op="Output (4)" from_port="output 1" to_op="Select (6)" to_port="collection"/>
          <connect from_op="Select (6)" from_port="selected" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="replace_missing_values" compatibility="9.1.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="715" y="85">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="default" value="value"/>
        <list key="columns"/>
        <parameter key="replenishment_value" value="missing"/>
      </operator>
      <operator activated="true" class="generate_concatenation" compatibility="9.1.000" expanded="true" height="82" name="Generate Concatenation" width="90" x="849" y="85">
        <parameter key="first_attribute" value="Att1_1"/>
        <parameter key="second_attribute" value="Att1_2"/>
        <parameter key="separator" value="|"/>
        <parameter key="trim_values" value="false"/>
      </operator>
      <operator activated="true" class="generate_concatenation" compatibility="9.1.000" expanded="true" height="82" name="Generate Concatenation (2)" width="90" x="983" y="85">
        <parameter key="first_attribute" value="Att1_1|Att1_2"/>
        <parameter key="second_attribute" value="Att1_3"/>
        <parameter key="separator" value="|"/>
        <parameter key="trim_values" value="false"/>
      </operator>
      <operator activated="true" class="generate_concatenation" compatibility="9.1.000" expanded="true" height="82" name="Generate Concatenation (3)" width="90" x="1117" y="85">
        <parameter key="first_attribute" value="Att1_1|Att1_2|Att1_3"/>
        <parameter key="second_attribute" value="Att1_4"/>
        <parameter key="separator" value="|"/>
        <parameter key="trim_values" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.1.000" expanded="true" height="82" name="Select Attributes (2)" width="90" x="1251" y="85">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Att1_1|Att1_2|Att1_3|Att1_4"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="sort" compatibility="9.1.000" expanded="true" height="82" name="Sort" width="90" x="1385" y="85">
        <parameter key="attribute_name" value="id"/>
        <parameter key="sorting_direction" value="increasing"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_op="Transpose" to_port="example set input"/>
      <connect from_op="Transpose" from_port="example set output" to_op="Loop Attributes" to_port="input 1"/>
      <connect from_op="Loop Attributes" from_port="output 1" to_op="Union Append" to_port="in 1"/>
      <connect from_op="Union Append" from_port="out 1" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Generate Concatenation" to_port="example set input"/>
      <connect from_op="Generate Concatenation" from_port="example set output" to_op="Generate Concatenation (2)" to_port="example set input"/>
      <connect from_op="Generate Concatenation (2)" from_port="example set output" to_op="Generate Concatenation (3)" to_port="example set input"/>
      <connect from_op="Generate Concatenation (3)" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Hope it helps,

Regards,

LIonel

IngoRM

Hi,

That's a nice challenge - I will send you my consulting bill ;-)

The solution is to:

Split the column along "|"
Transpose the data so that the column contents become rows
Loop over the Attributes (the former rows)
Inside the loop, Remove Duplicates for each attribute (using the loop_attribute macro)
Only keep the resulting attribute with Select Attributes
Filter Examples to remove the rows with missing (which is a result of the different lengths of the cell contents)
Transpose the data back
Generate Aggregate to concatenate the result back into one cell
Outside of the loop, Append all the individual results

I have attached an Excel file with the data from your original post as well as the complete process as .rmp file which you can import in File -> Import Process... in RapidMiner Studio.

Hope this helps,

Ingo

SplitDuplicatesProcess.rmp

SplitDuplicates.xlsx

IngoRM

@lionelderkrikor has been beating me to it

But then again my process handles the missings and works for arbitrary amount of items. So I consider this an "even"

lionelderkrikor

Hi @IngoRM,

Let's be objective : I avour my defeat ...

Great process !

I did not think about Generate Aggregate.

Regards and ... Congratulations

Lionel

IngoRM

Haha

That's how I constantly feel: there is always another operator I did not think of

Achint

@IngoRM: Thanks alot for the solution to this along with the attachements. Its quite a process to follow but understandable.

Achint

@lionelderkrikor : Thank you as well for your time on this solution.

Achint

Hi @IngoRM:

Hope you are doing well.

Thanks for the solution you provided me earlier. The example you have sent me with data and RMP file contains only "One column" in the data file. Although i have multiple columns in the data file but need to remove duplicate from only the specified column. How do we select that particular attribute for removing the duplicates and loop attribute only for that column not all?
Please find the attached excel file with multiple columns and with the column to be worked upon highlighted in Yellow.(two highlighted in Orange are concatenated to new column in Yellow)

Attached in the rmp file you provided, great if you can provided the change required in it when connected to the example set as attached.

Hoping to find a way to make this possible as well from your side.

Looking forward to hearing from you! Thanks a lot and a happy new year.

Regards,
Achint Kr

SplitDuplicatesProcess.rmp

Example set.xlsx

Telcontar120

You can try using Ingo's solution inside a Loop Attributes operator and specify only the set of attributes that are relevant with either the subset selection or using a regular expression if they have a similar naming convention.

IngoRM

Hi again,

Yes, you could use a loop. But since you mentioned that you only have one column for which you want to perform the "deduplication", it might be easier just to divide the data into two parts (one with that column only and one with all the other columns). You can then perform the process above on the selected column and join the results later on.

Attached is the modified process including some annotations as well as the data this process runs on (it wasn't immediately obvious to me in your data which one is the column you want to work on so I used the original data here again - I am sure you can adapt to your data).

Hope this helps,
Ingo

SplitDuplicatesProcessMultiple.rmp

SplitDuplicates.xlsx