Adding a column that performs a distinct count

darrenvermaak
darrenvermaak New Altair Community Member
edited November 5 in Community Q&A

Hi all,

Very new to Rapidminer!

I am needing to add a column into my table that runs a distinct count.

This is a sample of my table currently:

 

CaseNumber     Date
1A 1/1/2015
1A 1/2/2015
1A 1/2/2015
2A 2/1/2015
3A 3/1/2015
2A 4/1/2015

 

The Distinct Count would perform a distinct count on Case Number so:

CaseNumber  Date DistinctCount
1A  1/1/2015    1
1A  1/2/2015  0
1A  1/2/2015  0
2A  2/1/2015  1
2A  3/1/2015  0
3A  4/1/2015  1

 

Basically I just want to count how many unique case numbers there are.

So casenumber 1A occurs 3 times, but it's still just the 1 case number.

Same thing for casenumber 2A, it occurs twice, but still just the 1 case number.

Tagged:

Answers

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Hi,


    That's pretty simple to do if you use the Aggregate Operator and select "Only Distinct"

     

    Here's a sample process attached.

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Retrieve Products" width="90" x="246" y="85">
    <parameter key="repository_entry" value="//Samples/data/Products"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="7.2.002" expanded="true" height="82" name="Aggregate" width="90" x="447" y="85">
    <list key="aggregation_attributes">
    <parameter key="Product Name" value="count"/>
    </list>
    <parameter key="group_by_attributes" value="Product Name"/>
    <parameter key="only_distinct" value="true"/>
    </operator>
    <connect from_op="Retrieve Products" from_port="output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • darrenvermaak
    darrenvermaak New Altair Community Member

    Hi Thomas,

    Thanks for the reply.

    I might missing something here, but I don't want just the distinct count. I want to keep all my existing data and add a new column that performs a distinct count on my CaseNumber field.

    When I run the Aggregate Operator and select "Only Distinct", I'm give just the distinct count and nothing else.

  • Telcontar120
    Telcontar120 New Altair Community Member

    Why not just use the "Remove Duplicate" operator?  Then your dataset will only contain the non-duplicated entries, and the total record count will equal the distinct record count.   You can specify the fields that define a unique record in that operator, so you can use both case number and date or any other combination of attributes.

     

    Best,

     

     

  • darrenvermaak
    darrenvermaak New Altair Community Member

    There are a number of additional fields in this dataset that I need. 

    In this dataset, "casenumber" refers to a specific survey of about 50 questions, therefore there will be duplicate casenumbers because there are a number of different questions for each casenumber.

    Unfortunatley removing duplicates is not an option.

  • darrenvermaak
    darrenvermaak New Altair Community Member

    I found a way to do this:

     

    Capture.PNG