🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Remove Duplicate Examples

User: "gracewei"
New Altair Community Member
Updated by Jocelyn
Hi, 

I'm working on some genetics data. I have 6151 examples and 157 attributes. My attributes are patient IDs and my examples are gene names. My goal is to transpose the matrix table. Here is a sample of my data set: 


My problem now is I can't use the "Transpose" operator because there are duplicate row/example names. In order to transpose it, the attribute name needs to be unique. I wish to find all the pairs that have the same example names and edit their names. I was thinking about doing a loop, but I don't really know where to start and what operators to use to change the row names. Can somebody give me some advises on how to achieve this? 

Thank you! 

Find more posts tagged with

Sort by:
1 - 2 of 21
    User: "cdaponte"
    New Altair Community Member
    Accepted Answer
    You can use the Remove duplicates operator, and select the output that shows you the duplicates examples. Once you get the duplicates you can rename them with the operator "Rename" or "Replace".
    User: "lionelderkrikor"
    New Altair Community Member
    Accepted Answer
    Hi @gracewei,

    Nice challenge, but honestly, I don't see any solution to perform automatically what you want to do with RapidMiner's native operator(s)...
    ... however there is a (relativ) simple solution using a Python script to perform this task.
    Basically, the script add a number to the name of the duplicate and this number is incremented according to the number of duplicate(s) of a name.
    Concretely the output example set looks like that : 



    After executing this process, all the names/values of the "gene_name" attribute are unique et thus you can transpose your exampleset...

    To execute this process, you need to : 
     - Install Python on your computer
     - Install the Python Scripting extension in RapidMiner (from the Marketplace)

    The process : 
    <?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-BETA">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.4.000-BETA" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" breakpoints="after" class="read_excel" compatibility="9.4.000-BETA" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
            <parameter key="excel_file" value="D:\Lionel\Formations_DataScience\Rapidminer\Tests_Rapidminer\Rename_Duplicates\Rename_Duplicates.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="gene_name.true.polynominal.attribute"/>
              <parameter key="1" value="Target.true.integer.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.000" expanded="true" height="103" name="Execute Python" width="90" x="313" y="34">
            <parameter key="script" value="import pandas&#10;from collections import Counter # Counter counts the number of occurrences of each item&#10;from itertools import tee, count&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def uniquify(seq, suffs = count(1)):&#10;    &quot;&quot;&quot;Make all the items unique by adding a suffix (1, 2, etc).&#10;&#10;    `seq` is mutable sequence of strings.&#10;    `suffs` is an optional alternative suffix iterable.&#10;    &quot;&quot;&quot;&#10;    not_unique = [k for k,v in Counter(seq).items() if v&gt;1] # so we have: ['name', 'zip']&#10;    # suffix generator dict - e.g., {'name': &lt;my_gen&gt;, 'zip': &lt;my_gen&gt;}&#10;    suff_gens = dict(zip(not_unique, tee(suffs, len(not_unique))))  &#10;    for idx,s in enumerate(seq):&#10;        try:&#10;            suffix = str(next(suff_gens[s]))&#10;        except KeyError:&#10;            # s was unique&#10;            continue&#10;        else:&#10;            seq[idx] += suffix&#10;&#10;def rm_main(data):&#10;&#10;  mylist = data['gene_name']            &#10;  uniquify(mylist, (f'_{x!s}' for x in range(1, 100)))&#10;  data['gene_name']  = mylist&#10;&#10;    # connect 2 output ports to see the results&#10;  return data"/>
            <parameter key="notebook_cell_tag_filter" value=""/>
            <parameter key="use_default_python" value="true"/>
            <parameter key="package_manager" value="conda (anaconda)"/>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Execute Python" to_port="input 1"/>
          <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Hope this will help in the future ...

    Regards,

    Lionel