Adding a column that performs a distinct count
Hi all,
Very new to Rapidminer!
I am needing to add a column into my table that runs a distinct count.
This is a sample of my table currently:
CaseNumber | Date |
1A | 1/1/2015 |
1A | 1/2/2015 |
1A | 1/2/2015 |
2A | 2/1/2015 |
3A | 3/1/2015 |
2A | 4/1/2015 |
The Distinct Count would perform a distinct count on Case Number so:
CaseNumber | Date | DistinctCount |
1A | 1/1/2015 | 1 |
1A | 1/2/2015 | 0 |
1A | 1/2/2015 | 0 |
2A | 2/1/2015 | 1 |
2A | 3/1/2015 | 0 |
3A | 4/1/2015 | 1 |
Basically I just want to count how many unique case numbers there are.
So casenumber 1A occurs 3 times, but it's still just the 1 case number.
Same thing for casenumber 2A, it occurs twice, but still just the 1 case number.
Answers
-
Hi,
That's pretty simple to do if you use the Aggregate Operator and select "Only Distinct"Here's a sample process attached.
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Retrieve Products" width="90" x="246" y="85">
<parameter key="repository_entry" value="//Samples/data/Products"/>
</operator>
<operator activated="true" class="aggregate" compatibility="7.2.002" expanded="true" height="82" name="Aggregate" width="90" x="447" y="85">
<list key="aggregation_attributes">
<parameter key="Product Name" value="count"/>
</list>
<parameter key="group_by_attributes" value="Product Name"/>
<parameter key="only_distinct" value="true"/>
</operator>
<connect from_op="Retrieve Products" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Thomas,
Thanks for the reply.
I might missing something here, but I don't want just the distinct count. I want to keep all my existing data and add a new column that performs a distinct count on my CaseNumber field.
When I run the Aggregate Operator and select "Only Distinct", I'm give just the distinct count and nothing else.
0 -
Why not just use the "Remove Duplicate" operator? Then your dataset will only contain the non-duplicated entries, and the total record count will equal the distinct record count. You can specify the fields that define a unique record in that operator, so you can use both case number and date or any other combination of attributes.
Best,
0 -
There are a number of additional fields in this dataset that I need.
In this dataset, "casenumber" refers to a specific survey of about 50 questions, therefore there will be duplicate casenumbers because there are a number of different questions for each casenumber.
Unfortunatley removing duplicates is not an option.
0 -