"Cluster/group by attribute ranges and classification density"
Marin
New Altair Community Member
I have a transactional data with 20 attributes a1 to a20 (numerical, nominal, binominal) and 4000 examples. Attribute a20 is classifiaction (binominal with value 1 or 0). I have to group/cluster data in the following way:
1. 3 to 6 groups/clusters (it can be a fixed number, e.g. 5)
2. groups have to be sorted ranges of numeric attribute a1. Values of a1 are between 100 and 10000. (sounds like discretization operators can be used)
3. main criterion is min/max density of negative classified examples (those with value of a20 equal to 0) in a given range.
Example set has roughly 10 % of those classified as negative. One example of a solution would be:
C1, group where a1 €[100,1000], density of negative = 30% (if there were 500 examples, 60 would be negative)
C2, group where a1 €<1000,2000], density of negative = 5% (if there were 100 examples, 50 would be negative)
C3, group where a1 €<2000,3000], density of negative = x (if there were y examples, x*y would be negative)
C4, group where a1 €<3000,6000], density of negative = x (if there were y examples, x*y would be negative)
C5, group where a1 €<6000,10000], density of negative = x (if there were y examples, x*y would be negative)
The goal is to group examples such that there are few groups of a1 and that in each group there are as much or as few negative examples as possible.
Which approach/process could solve this grouping/discretization problem? I have been unsuccesfully trying to cluster it for some time now.
1. 3 to 6 groups/clusters (it can be a fixed number, e.g. 5)
2. groups have to be sorted ranges of numeric attribute a1. Values of a1 are between 100 and 10000. (sounds like discretization operators can be used)
3. main criterion is min/max density of negative classified examples (those with value of a20 equal to 0) in a given range.
Example set has roughly 10 % of those classified as negative. One example of a solution would be:
C1, group where a1 €[100,1000], density of negative = 30% (if there were 500 examples, 60 would be negative)
C2, group where a1 €<1000,2000], density of negative = 5% (if there were 100 examples, 50 would be negative)
C3, group where a1 €<2000,3000], density of negative = x (if there were y examples, x*y would be negative)
C4, group where a1 €<3000,6000], density of negative = x (if there were y examples, x*y would be negative)
C5, group where a1 €<6000,10000], density of negative = x (if there were y examples, x*y would be negative)
The goal is to group examples such that there are few groups of a1 and that in each group there are as much or as few negative examples as possible.
Which approach/process could solve this grouping/discretization problem? I have been unsuccesfully trying to cluster it for some time now.
Tagged:
0
Answers
-
Hi Marin!
Hope you guys had a ball in Dortmund, sorry I wasn't able to attend. Consider yourselves spared! As to your problem I'm sure there are better ways to solve this, but let the following kick off proceedings...
I've looked at this as a regression optimisation problem, you need to minimise the difference between the average value of att20 groupings and the best they could be, which is 1 in all cases. So you grind up the averages as your prediction and check the difference, like this...
Now you can get very fancy about this, change the type of binning and so on; but the thing to notice is how simple it is to do optimisations in RM, and what is more important, how easy it is to alter them!
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.10" expanded="true" name="Process">
<process expanded="true" height="358" width="748">
<operator activated="true" class="generate_data" compatibility="5.0.10" expanded="true" height="60" name="Generate Data" width="90" x="45" y="210">
<parameter key="number_of_attributes" value="1"/>
<parameter key="attributes_lower_bound" value="100.0"/>
<parameter key="attributes_upper_bound" value="10000.0"/>
</operator>
<operator activated="true" class="generate_copy" compatibility="5.0.10" expanded="true" height="76" name="Generate Range Attribute" width="90" x="179" y="210">
<parameter key="attribute_name" value="att1"/>
<parameter key="new_name" value="att1_range"/>
</operator>
<operator activated="true" class="optimize_parameters_grid" compatibility="5.0.10" expanded="true" height="112" name="Optimise Range Attribute" width="90" x="313" y="210">
<list key="parameters">
<parameter key="Fill range attribute.number_of_bins" value="[2.0;10;10;linear]"/>
</list>
<process expanded="true" height="358" width="744">
<operator activated="true" class="discretize_by_bins" compatibility="5.0.10" expanded="true" height="94" name="Fill range attribute" width="90" x="108" y="29">
<parameter key="create_view" value="true"/>
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="att1_range"/>
<parameter key="number_of_bins" value="10"/>
<parameter key="range_name_type" value="short"/>
</operator>
<operator activated="true" class="aggregate" compatibility="5.0.10" expanded="true" height="76" name="Aggregate" width="90" x="246" y="30">
<list key="aggregation_attributes">
<parameter key="label" value="average"/>
</list>
<parameter key="group_by_attributes" value="att1_range"/>
</operator>
<operator activated="true" class="generate_empty_attribute" compatibility="5.0.10" expanded="true" height="76" name="Generate Meta label" width="90" x="380" y="30">
<parameter key="name" value="max"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.0.10" expanded="true" height="94" name="Set Meta Label to max" width="90" x="112" y="165">
<parameter key="default" value="value"/>
<list key="columns"/>
<parameter key="replenishment_value" value="1"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.0.10" expanded="true" height="76" name="Average is Prediction" width="90" x="246" y="165">
<parameter key="name" value="average(label)"/>
<parameter key="target_role" value="prediction"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.0.10" expanded="true" height="76" name="Meta Label is 1" width="90" x="380" y="165">
<parameter key="name" value="max"/>
<parameter key="target_role" value="label"/>
</operator>
<operator activated="true" class="performance_regression" compatibility="5.0.10" expanded="true" height="76" name="Performance is difference!" width="90" x="581" y="30">
<parameter key="main_criterion" value="absolute_error"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="absolute_error" value="true"/>
</operator>
<connect from_port="input 1" to_op="Fill range attribute" to_port="example set input"/>
<connect from_op="Fill range attribute" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Generate Meta label" to_port="example set input"/>
<connect from_op="Generate Meta label" from_port="example set output" to_op="Set Meta Label to max" to_port="example set input"/>
<connect from_op="Set Meta Label to max" from_port="example set output" to_op="Average is Prediction" to_port="example set input"/>
<connect from_op="Average is Prediction" from_port="example set output" to_op="Meta Label is 1" to_port="example set input"/>
<connect from_op="Meta Label is 1" from_port="example set output" to_op="Performance is difference!" to_port="labelled data"/>
<connect from_op="Performance is difference!" from_port="performance" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Generate Range Attribute" to_port="example set input"/>
<connect from_op="Generate Range Attribute" from_port="example set output" to_op="Optimise Range Attribute" to_port="input 1"/>
<connect from_op="Optimise Range Attribute" from_port="performance" to_port="result 1"/>
<connect from_op="Optimise Range Attribute" from_port="parameter" to_port="result 2"/>
<connect from_op="Optimise Range Attribute" from_port="result 1" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
Have fun..
;D
0 -
Ho, ho. Lookie who's here. You bet we had it awesome (check it out at http://s786.photobucket.com/albums/yy147/MarinM/RCOMM 2010/ ). Saying we are to be considered spared,... just means you haven't met us all, yet.
Your approach gave me good ideas, and I thank you for it. I forgot to mention that this approach won't work on this problem since 1 in all cases is not the optimal (best they could be) solution because it is binominal problem (max one class in one lot means it should be min in adjacent lot). Best solution is when in odd lots these averages are as close to 1 and in even lots as close to zero (or vice versa). Anyways, I used your process in a similar fashion to obtain a good solution: I created 400 bins. By logging I can see the averaging on each of 400 bins. Merging it again I got 40 bins (drawing the function of these values made it a simple task since all bins are same size). Afterwards I repeated the process and manually made 5 bins. Thanks for your time.
Cheerz,
Marin0