filter all duplicate examples

Updated Nov 5, 2024 by Jocelyn

I'm a newbie in rapidminer. I want to filter all the example that has duplicate value, i use below process but if a name appears 5 times the result show 4times of it how can I filter all the 5 and still have other attr in my result...

Find more posts tagged with

AI Studio

Getting Started

Filtering

Sort by:

1 - 13 of 131

lionelderkrikor

New Altair Community Member

Hi @neginz,

Can you share your dataset and your process ?

Can you too explain with an example what you get now and what you want obtain ?

Regards,

Lionel

New Altair Community Member

If I understand you correctly, you want to eliminate any records that have duplicates. Here's a simple technique I have used to do this in the past. First, use Aggregate to group by name (or whatever constitutes the unique key that defines a duplicate, and note this can be more than one field) and count of name, which will give you a count of how many times each name appears. Filter Examples for that set for any record that has a count greater than one, and then Join (using Inner Join) back to the original dataset. Presto---you should then have only the records that appeared once!

New Altair Community Member

hi @lionelderkrikor

my data are customer's comment and I want to extract rows with authors comment more than one time. in the process, I create for example when we have 2 rows with the same author the result show only one of them .(when the absolute count in pic =2 )I think its because of the operation "remove duplicate" it removes only duplicates value, not all of the value that has duplicates actually one of them remains and not remove.

screenshot of data

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="free_memory" compatibility="8.2.001" expanded="true" height="68" name="Free Memory" width="90" x="782" y="646"/>
      <operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve tablet-300-eng-f" width="90" x="112" y="340">
        <parameter key="repository_entry" value="../../data/Digikala-Data/tablet-300-eng-f"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="112" y="34">
        <parameter key="invert_filter" value="true"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Author.contains.guest"/>
        </list>
      </operator>
      <operator activated="true" class="multiply" compatibility="8.2.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="34"/>
      <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="313" y="289">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="Author"/>
        <parameter key="attributes" value="Comment id|Author|Content"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="Author"/>
        <parameter key="attributes" value="Comment id|Author|Content"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="remove_duplicates" compatibility="8.2.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="514" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Author"/>
      </operator>
      <operator activated="true" class="set_minus" compatibility="8.2.001" expanded="true" height="82" name="Set Minus" width="90" x="715" y="238"/>
      <connect from_op="Retrieve tablet-300-eng-f" from_port="output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Set Minus" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
      <connect from_op="Remove Duplicates" from_port="example set output" to_op="Set Minus" to_port="subtrahend"/>
      <connect from_op="Set Minus" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

3.png

7.PNG

New Altair Community Member

hi @Telcontar120

tnx for the help. I try that before without joining part and the result only has 2 attr that one of them is a count and the of the other.coud u please more explain about the joining part?

result without inner join operator

6.PNG

New Altair Community Member

If you post a small data sample, it would be easier to help you.

Basically you want to take the output you are showing, but filter it for those records that only have a count of 1.

Then you will use that to join back to the original full dataset that has all the duplicates, but the inner join will only keep the records that have a count of one.

New Altair Community Member

@Telcontar120

sorry, but how can I post excel data here. it has error for file extension even whenIi use.rar .

New Altair Community Member

Just post it as csv or txt

New Altair Community Member

tnx sorry @Telcontar120

its small sample of my data. I want my result have the "comment id" attr.

sorry for my English

forum.csv

New Altair Community Member

Here is a process that does what you describe in your original post. It removes posts from authors that have more than one comment (e.g., it removes all items included in duplicate sets by author). You should be able to adapt this to your needs very easily. The first operator will need to have the path to your data file modified of course.

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="9.0.000-BETA" expanded="true" height="68" name="Read CSV" width="90" x="45" y="85">
        <parameter key="csv_file" value="C:\Users\brian\Downloads\forum.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="skip_comments" value="true"/>
        <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
        <list key="annotations"/>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Comment id.true.polynominal.attribute"/>
          <parameter key="1" value="Author.true.polynominal.attribute"/>
          <parameter key="2" value="Title.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="9.0.000-BETA" expanded="true" height="82" name="Aggregate" width="90" x="179" y="85">
        <list key="aggregation_attributes">
          <parameter key="Comment id" value="count"/>
        </list>
        <parameter key="group_by_attributes" value="Author"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.0.000-BETA" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="count(Comment id).eq.1"/>
        </list>
      </operator>
      <operator activated="true" class="concurrency:join" compatibility="9.0.000-BETA" expanded="true" height="82" name="Join" width="90" x="514" y="85">
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="Author" value="Author"/>
        </list>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="original" to_op="Join" to_port="right"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Join" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

New Altair Community Member

New Microsoft PowerPoint Presentation.jpg

@Telcontar120 tnx for ur help but there is not what I wanted. I need the result like in the picture below. I hope you'll get it now.

New Altair Community Member

Accepted Answer

Simple, this is just the complement of what I already posted. Simply change the Filter Examples condition to count>1 rather than =1 and you will get ONLY the duplicates. I thought you did NOT want the duplicates.

New Altair Community Member