Find the lowest value of an attribute in relation to the other
Hello.
I'm just getting started to learn Radpidminer and would like some help to my first project.
I've managed to import my data into RM and made a predictive process on some training data and can predict my unlabeled data based on the trained data.
Now, I've trained my model with the raw training data, but I had realized that I need to cleanse the data first. See I only want my model to be trained with the data with the lowest possible value of the column B.
So here is my setup;
Column A - Labeled data
Column B - Data where i only want the lowest value
Coumn C - Data
Example:
First row
Column A: 100
Column B: 20
Column C: 50
Second row
Column A: 100
Column B: 19
Column C: 50
Now, since A+C is matching I would only want my model to be trained with the second row, since the first row has a higher value of column B.
Is this achiveable in Rapidminer?
I do not hope that my explanation is too confusing
Best Answers
-
I only want to maintain the lowest value if the other columns are the same. If I invent an example with a car, it might make more sense .
I have the following data from the car in my database;
Speed, gear selection, fuel consumption.
I've got alot of rows with data from the car, for example;
100 km / h, 5th gear, 0.3 l / km
100 km / h, 4th gear, 0.35 km / l
Gear selection is my label. In this case, I keep only the row with the 5th gear because it gives me the lowest fuel consumption.
But if I have a row with the following data;
95 km / h, 4th gear, 0.35 km / l
I would still keep this row, becuase now speed has changed.
0 -
Mathias,
have a look at the attached process, i think this is what you want to do.
Best,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="subprocess" compatibility="7.2.001" expanded="true" height="82" name="Subprocess" width="90" x="45" y="85">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.2.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
<list key="attribute_values">
<parameter key="speed" value="100"/>
<parameter key="gear" value="5"/>
<parameter key="fuel consumption" value="0.3"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="7.2.001" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="136">
<list key="attribute_values">
<parameter key="speed" value="100"/>
<parameter key="gear" value="4"/>
<parameter key="fuel consumption" value="0.35"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="7.2.001" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="45" y="238">
<list key="attribute_values">
<parameter key="speed" value="95"/>
<parameter key="gear" value="4"/>
<parameter key="fuel consumption" value="0.35"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="append" compatibility="7.2.001" expanded="true" height="124" name="Append" width="90" x="179" y="85"/>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
<connect from_op="Append" from_port="merged set" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Get sample data</description>
</operator>
<operator activated="true" class="multiply" compatibility="7.2.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="136"/>
<operator activated="true" class="aggregate" compatibility="7.2.001" expanded="true" height="82" name="Aggregate" width="90" x="380" y="85">
<list key="aggregation_attributes">
<parameter key="fuel consumption" value="minimum"/>
</list>
<parameter key="group_by_attributes" value="speed"/>
</operator>
<operator activated="true" class="join" compatibility="7.2.001" expanded="true" height="82" name="Join" width="90" x="581" y="136">
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="minimum(fuel consumption)" value="fuel consumption"/>
<parameter key="speed" value="speed"/>
</list>
</operator>
<connect from_op="Subprocess" from_port="out 1" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
I think there may be an even easier solution here, using only one aggregate operator. Just group by the two attributes that define a unique case and set the third attribute to capture the minimum value. Take a look at the attached sample process and dataset and see if it works.
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.2.002" expanded="true" height="68" name="Read CSV" width="90" x="45" y="85">
<parameter key="csv_file" value="C:\Users\brian\Downloads\dedupe test.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Att_A.true.integer.attribute"/>
<parameter key="1" value="Att_B.true.integer.attribute"/>
<parameter key="2" value="Att_C.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="aggregate" compatibility="7.2.002" expanded="true" height="82" name="Aggregate" width="90" x="179" y="85">
<list key="aggregation_attributes">
<parameter key="Att_C" value="minimum"/>
<parameter key="Att_C" value="count"/>
</list>
<parameter key="group_by_attributes" value="Att_A|Att_B"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
<connect from_op="Aggregate" from_port="original" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>0
Answers
-
Hi Mathias,
welcome to the community!
For sure it is doable, but i am not totally sure what you want do.
For your given example, i guess you would first calculate the min per example (horizontal aggregation)
so you would get
id min(A,B,C)
1 20
2 19
And then take the min of the mins (vertical aggregation), which yields to take the 2?
If yes, its a combination of: Generate Aggregation to get min(A,B,C), Aggregation to get min of min(A,B,C) and i think two joins.
Best,
Martin
0 -
I only want to maintain the lowest value if the other columns are the same. If I invent an example with a car, it might make more sense .
I have the following data from the car in my database;
Speed, gear selection, fuel consumption.
I've got alot of rows with data from the car, for example;
100 km / h, 5th gear, 0.3 l / km
100 km / h, 4th gear, 0.35 km / l
Gear selection is my label. In this case, I keep only the row with the 5th gear because it gives me the lowest fuel consumption.
But if I have a row with the following data;
95 km / h, 4th gear, 0.35 km / l
I would still keep this row, becuase now speed has changed.
0 -
Mathias,
have a look at the attached process, i think this is what you want to do.
Best,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="subprocess" compatibility="7.2.001" expanded="true" height="82" name="Subprocess" width="90" x="45" y="85">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.2.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
<list key="attribute_values">
<parameter key="speed" value="100"/>
<parameter key="gear" value="5"/>
<parameter key="fuel consumption" value="0.3"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="7.2.001" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="136">
<list key="attribute_values">
<parameter key="speed" value="100"/>
<parameter key="gear" value="4"/>
<parameter key="fuel consumption" value="0.35"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="7.2.001" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="45" y="238">
<list key="attribute_values">
<parameter key="speed" value="95"/>
<parameter key="gear" value="4"/>
<parameter key="fuel consumption" value="0.35"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="append" compatibility="7.2.001" expanded="true" height="124" name="Append" width="90" x="179" y="85"/>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
<connect from_op="Append" from_port="merged set" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Get sample data</description>
</operator>
<operator activated="true" class="multiply" compatibility="7.2.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="136"/>
<operator activated="true" class="aggregate" compatibility="7.2.001" expanded="true" height="82" name="Aggregate" width="90" x="380" y="85">
<list key="aggregation_attributes">
<parameter key="fuel consumption" value="minimum"/>
</list>
<parameter key="group_by_attributes" value="speed"/>
</operator>
<operator activated="true" class="join" compatibility="7.2.001" expanded="true" height="82" name="Join" width="90" x="581" y="136">
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="minimum(fuel consumption)" value="fuel consumption"/>
<parameter key="speed" value="speed"/>
</list>
</operator>
<connect from_op="Subprocess" from_port="out 1" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
I think there may be an even easier solution here, using only one aggregate operator. Just group by the two attributes that define a unique case and set the third attribute to capture the minimum value. Take a look at the attached sample process and dataset and see if it works.
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.2.002" expanded="true" height="68" name="Read CSV" width="90" x="45" y="85">
<parameter key="csv_file" value="C:\Users\brian\Downloads\dedupe test.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Att_A.true.integer.attribute"/>
<parameter key="1" value="Att_B.true.integer.attribute"/>
<parameter key="2" value="Att_C.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="aggregate" compatibility="7.2.002" expanded="true" height="82" name="Aggregate" width="90" x="179" y="85">
<list key="aggregation_attributes">
<parameter key="Att_C" value="minimum"/>
<parameter key="Att_C" value="count"/>
</list>
<parameter key="group_by_attributes" value="Att_A|Att_B"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
<connect from_op="Aggregate" from_port="original" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>0 -
Thank you for your help - just what I was looking for
0