Hi All, I need to remove duplicates from each cell of an attribute/column. Do we have a quick soluti

Achint
Achint New Altair Community Member
edited November 2024 in Community Q&A
I am new to rapidminer and i'm working on a huge project in my company, hence require your help here.

The below is what i need to implement 

Problem:
COLUMN
A|B|V|A|B
C|V|B|C
E|R|T|Y|E

Solution required:
COLUMN
A|B|V
C|V|B
E|R|T|Y

I need a solution as above where i am removing the duplicate entries in the cell separated by "|".

Appreciate your help on this.
Tagged:

Answers

  • IngoRM
    IngoRM New Altair Community Member
    edited December 2018
    Hi,
    That's a nice challenge - I will send you my consulting bill ;-)
    The solution is to:
    1. Split the column along "|"
    2. Transpose the data so that the column contents become rows
    3. Loop over the Attributes (the former rows)
    4. Inside the loop, Remove Duplicates for each attribute (using the loop_attribute macro)
    5. Only keep the resulting attribute with Select Attributes
    6. Filter Examples to remove the rows with missing (which is a result of the different lengths of the cell contents)
    7. Transpose the data back
    8. Generate Aggregate to concatenate the result back into one cell
    9. Outside of the loop, Append all the individual results
    I have attached an Excel file with the data from your original post as well as the complete process as .rmp file which you can import in File -> Import Process... in RapidMiner Studio.

    Hope this helps,
    Ingo
  • IngoRM
    IngoRM New Altair Community Member
    edited December 2018
    @lionelderkrikor has been beating me to it :blush:
    But then again my process handles the missings and works for arbitrary amount of items.  So I consider this an "even" :wink:
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @IngoRM,

    Let's be objective : I avour my defeat ... ;) Great process ! 

    I did not think about Generate Aggregate.

    Regards and ... Congratulations

    Lionel
  • IngoRM
    IngoRM New Altair Community Member
    Haha :smiley:  That's how I constantly feel: there is always another operator I did not think of :smiley:
  • Achint
    Achint New Altair Community Member
    @IngoRM: Thanks alot for the solution to this along with the attachements. Its quite a process to follow but understandable. :) 
  • Achint
    Achint New Altair Community Member
    @lionelderkrikor : Thank you as well for your time on this solution. :) 
  • Achint
    Achint New Altair Community Member
    edited January 2019
    Hi @IngoRM:

    Hope you are doing well. 

    Thanks for the solution you provided me earlier. The example you have sent me with data and RMP file contains only "One column" in the data file. Although i have multiple columns in the data file but need to remove duplicate from only the specified column. How do we select that particular attribute for removing the duplicates and loop attribute only for that column not all?
    Please find the attached excel file with multiple columns and with the column to be worked upon highlighted in Yellow.(two highlighted in Orange are concatenated to new column in Yellow) 

    Attached in the rmp file you provided, great if you can provided the change required in it when connected to the example set as attached.

    Hoping to find a way to make this possible as well from your side.

    Looking forward to hearing from you! Thanks a lot and a happy new year.   

    Regards,
    Achint Kr
  • Telcontar120
    Telcontar120 New Altair Community Member
    You can try using Ingo's solution inside a Loop Attributes operator and specify only the set of attributes that are relevant with either the subset selection or using a regular expression if they have a similar naming convention.
  • IngoRM
    IngoRM New Altair Community Member
    Hi again,
    Yes, you could use a loop.  But since you mentioned that you only have one column for which you want to perform the "deduplication", it might be easier just to divide the data into two parts (one with that column only and one with all the other columns).  You can then perform the process above on the selected column and join the results later on.
    Attached is the modified process including some annotations as well as the data this process runs on (it wasn't immediately obvious to me in your data which one is the column you want to work on so I used the original data here again - I am sure you can adapt to your data).
    Hope this helps,
    Ingo