filter all duplicate examples
hi
I'm a newbie in rapidminer. I want to filter all the example that has duplicate value, i use below process but if a name appears 5 times the result show 4times of it how can I filter all the 5 and still have other attr in my result...
Best Answer
-
Simple, this is just the complement of what I already posted. Simply change the Filter Examples condition to count>1 rather than =1 and you will get ONLY the duplicates. I thought you did NOT want the duplicates.
3
Answers
-
Hi @neginz,
Can you share your dataset and your process ?
Can you too explain with an example what you get now and what you want obtain ?
Regards,
Lionel
0 -
If I understand you correctly, you want to eliminate any records that have duplicates. Here's a simple technique I have used to do this in the past. First, use Aggregate to group by name (or whatever constitutes the unique key that defines a duplicate, and note this can be more than one field) and count of name, which will give you a count of how many times each name appears. Filter Examples for that set for any record that has a count greater than one, and then Join (using Inner Join) back to the original dataset. Presto---you should then have only the records that appeared once!
0 -
my data are customer's comment and I want to extract rows with authors comment more than one time. in the process, I create for example when we have 2 rows with the same author the result show only one of them .(when the absolute count in pic =2 )I think its because of the operation "remove duplicate" it removes only duplicates value, not all of the value that has duplicates actually one of them remains and not remove.
screenshot of data
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="free_memory" compatibility="8.2.001" expanded="true" height="68" name="Free Memory" width="90" x="782" y="646"/>
<operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve tablet-300-eng-f" width="90" x="112" y="340">
<parameter key="repository_entry" value="../../data/Digikala-Data/tablet-300-eng-f"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="112" y="34">
<parameter key="invert_filter" value="true"/>
<list key="filters_list">
<parameter key="filters_entry_key" value="Author.contains.guest"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="34"/>
<operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="313" y="289">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value="Author"/>
<parameter key="attributes" value="Comment id|Author|Content"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value="Author"/>
<parameter key="attributes" value="Comment id|Author|Content"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="remove_duplicates" compatibility="8.2.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="514" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Author"/>
</operator>
<operator activated="true" class="set_minus" compatibility="8.2.001" expanded="true" height="82" name="Set Minus" width="90" x="715" y="238"/>
<connect from_op="Retrieve tablet-300-eng-f" from_port="output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Set Minus" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_op="Set Minus" to_port="subtrahend"/>
<connect from_op="Set Minus" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
tnx for the help. I try that before without joining part and the result only has 2 attr that one of them is a count and the of the other.coud u please more explain about the joining part?
result without inner join operator
0 -
If you post a small data sample, it would be easier to help you.
Basically you want to take the output you are showing, but filter it for those records that only have a count of 1.
Then you will use that to join back to the original full dataset that has all the duplicates, but the inner join will only keep the records that have a count of one.
0 -
sorry, but how can I post excel data here. it has error for file extension even whenIi use.rar .
0 -
Just post it as csv or txt
0 -
tnx sorry @Telcontar120
its small sample of my data. I want my result have the "comment id" attr.
sorry for my English
0 -
Here is a process that does what you describe in your original post. It removes posts from authors that have more than one comment (e.g., it removes all items included in duplicate sets by author). You should be able to adapt this to your needs very easily. The first operator will need to have the path to your data file modified of course.
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="9.0.000-BETA" expanded="true" height="68" name="Read CSV" width="90" x="45" y="85">
<parameter key="csv_file" value="C:\Users\brian\Downloads\forum.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="skip_comments" value="true"/>
<parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
<list key="annotations"/>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Comment id.true.polynominal.attribute"/>
<parameter key="1" value="Author.true.polynominal.attribute"/>
<parameter key="2" value="Title.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
</operator>
<operator activated="true" class="aggregate" compatibility="9.0.000-BETA" expanded="true" height="82" name="Aggregate" width="90" x="179" y="85">
<list key="aggregation_attributes">
<parameter key="Comment id" value="count"/>
</list>
<parameter key="group_by_attributes" value="Author"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="9.0.000-BETA" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="count(Comment id).eq.1"/>
</list>
</operator>
<operator activated="true" class="concurrency:join" compatibility="9.0.000-BETA" expanded="true" height="82" name="Join" width="90" x="514" y="85">
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="Author" value="Author"/>
</list>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="right"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>1 -
@Telcontar120 tnx for ur help but there is not what I wanted. I need the result like in the picture below. I hope you'll get it now.
0 -
Simple, this is just the complement of what I already posted. Simply change the Filter Examples condition to count>1 rather than =1 and you will get ONLY the duplicates. I thought you did NOT want the duplicates.
3 -
yes, it works tnx a lot for ur help :smileyvery-happy: . my mistake was that I count the author instead of count comment id . . .
1 -
Hi @Telcontar120,
I will be severe :
I'm waiting from an Ambassador and beta tester of RM 9, that you realize this task with the new "turbo prep" tool: it is feasible !
Dataset :
Result :
.....I'm joking of course !!!.....:catwink::catlol:
Have a nice day and happy experimentations,
Regards,
Lionel
2