"Aggregate and remove attributes"
Jony
New Altair Community Member
Hi,
I have a data set containing 2000 columns or attributes having values as real number. What i want to do is aggregate (sum) three attributes and create a new attribute with the summing result and remove the last two leaving first attribute and the resulting attribute. I can do that by using operators generate attribute (sum) and remove range. But i want to impose this procedure to the whole 2000 attributes, that means three attributes will be summed leaving two attributes (including resulting one) and then next three will be summed leaving two attributes and so on. Is there any procedure that i can impose that generate attribute and remove range to the whole data set automatically or is there any other procedure to do it? please let me know if you need more specifications.
Jony
I have a data set containing 2000 columns or attributes having values as real number. What i want to do is aggregate (sum) three attributes and create a new attribute with the summing result and remove the last two leaving first attribute and the resulting attribute. I can do that by using operators generate attribute (sum) and remove range. But i want to impose this procedure to the whole 2000 attributes, that means three attributes will be summed leaving two attributes (including resulting one) and then next three will be summed leaving two attributes and so on. Is there any procedure that i can impose that generate attribute and remove range to the whole data set automatically or is there any other procedure to do it? please let me know if you need more specifications.
Jony
0
Answers
-
Hi Jony,
if your attributes follow a certain naming scheme (like att_1 ... att_2000) you could propably use the Loop Operator with the iteration macro set. This macro you could use to select which attributes you want to sum up.
I hope this gives you an idea how to proceed with your problem, if not feel free to ask.
Best regards,
David0 -
Hi David,
Thanks for your reply. I knew loop operators needs to be used but the problem is i am not an expert in macro.
Every time i use a loop it just give me the same result 1000 times if iteration 1000 is used. What i want to do to get the result in a single table which includes summation of every three attributes column and only the first column of those three and goes on..
if the inputs are 1,2,3,4.. i want the output as 1,(1+2+3),4,(4+5+6)...
renaming generated attributes (which are sum of three) is also a problem in loop. the naming needs to be variable so that it changes in every iteration.
do you think it is possible for you to make such a process and share it with me?then i can understand properly.
regards
Jony0 -
btw, yes my the attributes follows a certain naming scheme like time 01, time 02, time 03...0
-
Here is a simple example process of how you can generate the needed aggregations including a simple naming scheme for the attribute names.
The result of this is a collection of the desired aggregations. What you need to do next is extract these values and simple add them to your dataset as it fits your requirements.<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.009">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="5.3.009" expanded="true" height="60" name="Generate Data" width="90" x="28" y="74">
<parameter key="number_of_attributes" value="16"/>
</operator>
<operator activated="true" class="loop" compatibility="5.3.009" expanded="true" height="76" name="Loop" width="90" x="313" y="75">
<parameter key="set_iteration_macro" value="true"/>
<parameter key="macro_start_value" value="0"/>
<parameter key="iterations" value="4"/>
<process expanded="true">
<operator activated="true" class="generate_macro" compatibility="5.3.009" expanded="true" height="76" name="Generate Macro" width="90" x="112" y="30">
<list key="function_descriptions">
<parameter key="i_1" value="%{iteration}*3+1"/>
<parameter key="i_2" value="%{iteration}*3+2"/>
<parameter key="i_3" value="%{iteration}*3+3"/>
</list>
</operator>
<operator activated="true" class="generate_aggregation" compatibility="5.3.009" expanded="true" height="76" name="Generate Aggregation" width="90" x="313" y="30">
<parameter key="attribute_name" value="a_sum_%{i_1}_%{i_2}_%{i_3}"/>
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|att%{i_1}|att%{i_2}|att%{i_3}"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.3.009" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="a_sum_%{i_1}_%{i_2}_%{i_3}"/>
</operator>
<connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
<connect from_op="Generate Macro" from_port="through 1" to_op="Generate Aggregation" to_port="example set input"/>
<connect from_op="Generate Aggregation" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Loop" to_port="input 1"/>
<connect from_op="Loop" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi David,
I have 132 attributes so i put the iteration number as 44. now i have 44 results with the last one as a_sum_130_131_132, this is fine. but the problem is all the aggregation results are same, i mean from the first aggregation 1_2_3 to last 130_131_132 i am adding only first three. so 1_2_3 result is right but all other are wrong as because they all are same (i have selected the parameter attributes in the generate attribute operator the first three). can you please tell me how to solve it, and i actually wanted to rename the generated attribute as sum_time 1_time 2_time 3 as my attributes names are time 1, time 2, time 3.. can you please tell me how can i do that?
thanks
Jony0 -
A possible problem might be that you don't use the macros correctly for selecting the attributes in the Generate Aggregation Operator.
In my example process I have used the names i_1, i_2 and i_3 which are set in each iteration to the number you want to aggregate like {1,2,3}, {4,5,6} and so on. So these macros are used in theattributes selection of Generate Aggregation Operator.
The entries there have to look like this:
att%{i_1}, att%{i_2} and att%{i_3} , so for each iteration the values of the macros are evaluated and the correct attributes can be selected.
The attribute name is simple defined in the filed named "attribute name" of the Generate Aggregation Operator. Here again the actual values of the macros are pasted for each iteration.
I hope this clarifies everything for you.
Best Regards,
David0 -
Hi David,
I understand your point, but i still getting trouble solving my problem.
When i use att%{i_1}, att%{i_2} and att%{i_3} in my generate aggregation operator parameter, it does not give me any result. The attributes got generated as att1, att2, (in parameter of the operator, not in the result) and so on, which are not available in my data. i firstly thought that by att1 it automatically selects the first attribute but it seems like it does not. btw, my first attribute is ID which is supposed not to be aggregated, i want my aggregation to start from the second attribute. i have also used iteration names as i_1. i_2. i_3 like yours and the function as well.
I can send you my data, but i cant find any attachment procedure.
regards
jony0 -
As i told u earlier that, i got 44 results but all of them are same result and the naming are different when i select my first three attributes in the generate attribute parameter. but when i put those att%{i_1}, att%{i_2} and att%{i_3}, i get no result, just one table with the row numbers in it. so i think my iterations works and everything is fine, i am not just being able to make the system learn that by att%{i_1}, att%{i_2} and att%{i_3} i am indicating the attributes. It seems like system can not read my attributes from att%{i_1}, att%{i_2} and att%{i_3}. dont know how to solve, seems like i am just one or two steps behind, but could not find it,,
regards
Jony0 -
I'm not quite sure if the solution of your problem might be this simple, but could it be that you're just using the wrong attribute names?
In my example the attributes are named the default way, which is att1,att2,... and for your data it is time_1, time_2, ...
so you have to use these names in the selection during your aggregation. So instead selecting att%{i_1} it should read time_%{i_1}.
RapidMiner is the refering to the actual attribute names and not some meta data. As a consequence of this you should not be worried about the ID attribute, becasue it has a different name.
regards,
David0 -
the problem is my attributes start with 'Time 12-31-22', then 'Time 01-01-00' then 'Time 01-01-02' then 'Time 01-01-04'.. how to do it then? it was ok if my attributes were just time_1 and 2 but these are the only even numbers and starts with 12-31-22.0
-
Handling even numbers is quite simple, you just adjust the numbers generated in your macros i_1, i_2 and i_3 according to this scheme:
i_1 is (%{iteration}*3)*2
i_2 is (%{iteration}*3+1)*2 and
i_3 is (%{iteration}*3+2)*2
so you get all even numbers in blocks of 3 starting with 0.
The case of the first attribute named 12-31-22 I would just handle separetely.0 -
The attribute names also changes in the middle and first portion as well, and those are not even numbers.
Like after Time 01-01-22 i have Time 01-02-00, Time 01-02-02 and after Time 01-02-22 there is Time 01-03-00 etc. how to handle those?0 -
You could try the 'Rename by Generic Names' operator to change all of your field names from Time 01-01-01, Time 01-01-02 into att_1 att_2, etc.
Then David's suggested process should work.
(A word of caution, make sure that all of your columns are in the order you want them to begin with, otherwise when they get replaced by Generic Names it won't be clear which column is which. )
Hope that helps.0 -
ok, but in the output i need the names according to input. will it be possible to get the names back accordingly?0
-
Yes, if you add an ID you can do you processing on one thread with generic attribute names & then join it with your original dataset after this is completed.
Is that what you meant?<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.012">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.012" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_direct_mailing_data" compatibility="5.3.012" expanded="true" height="60" name="Generate Direct Mailing Data" width="90" x="45" y="30"/>
<operator activated="true" class="generate_id" compatibility="5.3.012" expanded="true" height="76" name="Generate ID" width="90" x="112" y="210"/>
<operator activated="true" class="rename_by_generic_names" compatibility="5.3.012" expanded="true" height="76" name="Rename by Generic Names" width="90" x="179" y="120"/>
<operator activated="true" class="subprocess" compatibility="5.3.012" expanded="true" height="76" name="YourProcessingHere" width="90" x="246" y="30">
<process expanded="true">
<connect from_port="in 1" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="join" compatibility="5.3.012" expanded="true" height="76" name="Join" width="90" x="313" y="165">
<list key="key_attributes"/>
</operator>
<connect from_op="Generate Direct Mailing Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Rename by Generic Names" to_port="example set input"/>
<connect from_op="Rename by Generic Names" from_port="example set output" to_op="YourProcessingHere" to_port="in 1"/>
<connect from_op="Rename by Generic Names" from_port="original" to_op="Join" to_port="right"/>
<connect from_op="YourProcessingHere" from_port="out 1" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi,
This helps, but the thing is after aggregating with the new names i get aggregated columns as sum_1_2_3 for example, but i want in the output column names as sum_time-12-31-22_time-01-01-00_time-01-01-04 and so on. i mean in the result i also want my original names. but if i do replace the names and do the aggregation then it gives result with new names (obviously). but hot to replace those with original names?0