Identify Duplicate examples

Hi,
I've a data in which I want to identify duplicates (unlike remove duplicate i want duplicate fields)
For example I've below data
Month Name Amount
Jul-15 John 10$
Aug-15 Alex 15$
Sep-15 John 5$
Jul-15 John 10$
if the above table is my input then i want only below in my results
Month Name Amount
Jul-15 John 10$
Jul-15 John 10$
Best Answer
-
If you don't actually need the duplicated examples, but rather need the count of how many times they appear this is how I would handle it:
1 - aggregate the table (Aggregate operator - group by all attributes and count on one of them)
2 - filter examples for all count(attribute) > 1
I'm assuming since there is no unique identifier you are ignoring you don't really need the duplicates the number of times they appear, but it might be useful to know how many times they appear!
0
Answers
-
hi...that was a good puzzle. I would do it this way:
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_id" compatibility="7.2.002" expanded="true" height="82" name="Generate ID" width="90" x="179" y="136"/>
<operator activated="true" class="multiply" compatibility="7.2.002" expanded="true" height="103" name="Multiply" width="90" x="313" y="136"/>
<operator activated="true" class="remove_duplicates" compatibility="7.2.002" expanded="true" height="82" name="Remove Duplicates" width="90" x="514" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Amount|Month|Name"/>
</operator>
<operator activated="true" class="set_minus" compatibility="7.2.002" expanded="true" height="82" name="Set Minus" width="90" x="715" y="136"/>
<connect from_port="input 1" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Set Minus" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_op="Set Minus" to_port="subtrahend"/>
<connect from_op="Set Minus" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Scott
0 -
If you don't actually need the duplicated examples, but rather need the count of how many times they appear this is how I would handle it:
1 - aggregate the table (Aggregate operator - group by all attributes and count on one of them)
2 - filter examples for all count(attribute) > 1
I'm assuming since there is no unique identifier you are ignoring you don't really need the duplicates the number of times they appear, but it might be useful to know how many times they appear!
0