sum up column value with group by

mengkoon007
mengkoon007 New Altair Community Member
edited November 2024 in Community Q&A

Hi,

 

I have the following columns:

a) id

b) month

c) year

d) value_1

e) value_2

with many other columns.

 

All the values in 'value_1' column, I want to sum up with group by id,month,year.

All the values in 'value_2' column, I want to sum up with group by id,month,year.

 

========

Example Set:

id   month  year     value_1    value_2       

1   Jan       2017     5               6

1   Jan       2017     7               8

1   Feb       2017     8               9

============

 

=============

Expected result:

id   month  year     value_1    value_2       

1   Jan       2017     12             14

1   Feb       2017     8               9

===========

 

I tried to use 1 Aggregate operator with the following input:

  • aggregation attribute (value_1) and aggregation functions (sum), then add entry
  • aggregation attribute (value_2) and aggregation functions (sum), 
  • then select group by attribute id,month,year

 

The following is the result I gotten which is incorrect,

1) There is now only 3 columns - id,value_1,value_2

2) It seems to add up all values with the same id => I need id,month,year matches then sum up

Tagged:

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answer

  • land
    land New Altair Community Member
    Answer ✓

    Hi,

    if you aggregate a dataset all other columns need to be removed because their value would be undefined. 

    In your scenario, you will have multiple rows of the original data set that are belonging to one id / month / year combination. Each of these combinations will be shown as one row in the resulting data set. For the defined aggreagtion functions you will get a column, because the average of any number of rows is defined. But what to do with the other columns? You need to drop them, because you don't know which value to assign for this particular group.

    If you want to calculate the average for all of them, you can use the default aggregation parameter in RapidMiner. It will then use one aggregation function for every attribute.

     

    Greetings,

      Sebastian

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member

    The Aggregate operator should be doing exactly what you are requesting.  I tested it with a short process and it seems to be working fine.  See the attached XML (you'll need to redirect the file path to read in your own data):

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="7.3.001" expanded="true" height="68" name="Read CSV" width="90" x="45" y="85">
    <parameter key="csv_file" value="C:\Users\brian\Downloads\test.txt"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="id.true.integer.attribute"/>
    <parameter key="1" value="month.true.polynominal.attribute"/>
    <parameter key="2" value="year.true.integer.attribute"/>
    <parameter key="3" value="value_1.true.integer.attribute"/>
    <parameter key="4" value="value_2 .true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="aggregate" compatibility="7.3.001" expanded="true" height="82" name="Aggregate" width="90" x="313" y="85">
    <list key="aggregation_attributes">
    <parameter key="value_1" value="sum"/>
    <parameter key="value_2" value="sum"/>
    </list>
    <parameter key="group_by_attributes" value="id|month|year"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
    <connect from_op="Aggregate" from_port="original" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

     

     

  • land
    land New Altair Community Member

    Hi,

    this is indeed how the operator is supposed to work. I guess you simply have a spelling error or something else in the group by selection. It seems to me, only id was correct and if you only group by id, the result is correct.

    Here's how it should look like if you select id and label of the iris dataset.

     

    Unbenannt.png

     

    Greetings, 

    Sebastian

  • IngoRM
    IngoRM New Altair Community Member

    Hi,

     

    It looks like you have defined only one group-by attribute but you will need all three of them.  Just move them all three over to the right side in the dialog.  If that is not possible (in case the meta data is not available), then add each one individually and press the "plus" icon in the dialog before adding the next.

     

    If this is not the reason for your problem, please post the process you have used (at least the relevant part) and - if possible - some data so we can have a closer look.

     

    Best,

    Ingo

  • IngoRM
    IngoRM New Altair Community Member

    Brilliant - three times the same answer within 3 minutes :-)

  • mengkoon007
    mengkoon007 New Altair Community Member

    Dear All,

     

    Thanks for the prompt help.

     

    I have used the method as per the screenshot, I think I might have figured why I only gotten id as the grouping.

     

    Before using the aggregate function, I have written an SQL in the read database operation by join 2 tables together. When using the screenshot, my selection actually shows

    • id
    • month(t2.date) => I suppose rapidminer does not recognised this format, thus I update the sql by given it a name 'Month' which it now group correctly
    • year(t2.date) => I suppose rapidminer does not recognised this format, thus I update the sql by given it a name 'Year' which it now group correctly

    ====

    SELECT t1.id as 'id', month(t2.date) as 'Month', year(t2.date) as 'Year'

    FROM table_1 t1

    LEFT JOIN table_2 t2

       on t1.common=t2.common

    =====

     

    However, I have another question.

     

    Before I applying the aggregate function, I have 10 attributes. After applying the function, it only shows those affected attributes. Does the aggregate function automatically removed all the non-related attributes?

  • land
    land New Altair Community Member
    Answer ✓

    Hi,

    if you aggregate a dataset all other columns need to be removed because their value would be undefined. 

    In your scenario, you will have multiple rows of the original data set that are belonging to one id / month / year combination. Each of these combinations will be shown as one row in the resulting data set. For the defined aggreagtion functions you will get a column, because the average of any number of rows is defined. But what to do with the other columns? You need to drop them, because you don't know which value to assign for this particular group.

    If you want to calculate the average for all of them, you can use the default aggregation parameter in RapidMiner. It will then use one aggregation function for every attribute.

     

    Greetings,

      Sebastian

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.