Adding a column that performs a distinct count

darrenvermaak

New Altair Community Member

Sep 16, 2016

Updated Nov 5, 2024 by Jocelyn

Hi all,

Very new to Rapidminer!

I am needing to add a column into my table that runs a distinct count.

This is a sample of my table currently:

CaseNumber	Date
1A	1/1/2015
1A	1/2/2015
1A	1/2/2015
2A	2/1/2015
3A	3/1/2015
2A	4/1/2015

The Distinct Count would perform a distinct count on Case Number so:

CaseNumber	Date	DistinctCount
1A	1/1/2015	1
1A	1/2/2015	0
1A	1/2/2015	0
2A	2/1/2015	1
2A	3/1/2015	0
3A	4/1/2015	1

Basically I just want to count how many unique case numbers there are.

So casenumber 1A occurs 3 times, but it's still just the 1 case number.

Same thing for casenumber 2A, it occurs twice, but still just the 1 case number.

Find more posts tagged with

AI Studio

Sort by:

1 - 5 of 51

Thomas_Ott

New Altair Community Member

Sep 16, 2016

Hi,

That's pretty simple to do if you use the Aggregate Operator and select "Only Distinct"

Here's a sample process attached.

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Retrieve Products" width="90" x="246" y="85">
        <parameter key="repository_entry" value="//Samples/data/Products"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="7.2.002" expanded="true" height="82" name="Aggregate" width="90" x="447" y="85">
        <list key="aggregation_attributes">
          <parameter key="Product Name" value="count"/>
        </list>
        <parameter key="group_by_attributes" value="Product Name"/>
        <parameter key="only_distinct" value="true"/>
      </operator>
      <connect from_op="Retrieve Products" from_port="output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

darrenvermaak

New Altair Community Member

Sep 16, 2016

Hi Thomas,

Thanks for the reply.

I might missing something here, but I don't want just the distinct count. I want to keep all my existing data and add a new column that performs a distinct count on my CaseNumber field.

When I run the Aggregate Operator and select "Only Distinct", I'm give just the distinct count and nothing else.

Telcontar120

New Altair Community Member

Sep 16, 2016

Why not just use the "Remove Duplicate" operator? Then your dataset will only contain the non-duplicated entries, and the total record count will equal the distinct record count. You can specify the fields that define a unique record in that operator, so you can use both case number and date or any other combination of attributes.

Best,

darrenvermaak

New Altair Community Member

Sep 16, 2016

There are a number of additional fields in this dataset that I need.

In this dataset, "casenumber" refers to a specific survey of about 50 questions, therefore there will be duplicate casenumbers because there are a number of different questions for each casenumber.

Unfortunatley removing duplicates is not an option.

darrenvermaak

New Altair Community Member

Sep 16, 2016

I found a way to do this:

Capture.PNG

Adding a column that performs a distinct count

Find more posts tagged with

Quick Links