Bagging for Imbalanced dataset
Hi everybody ,
I have a very imbalanced dataset (t: 14% , f:86%) , I want to use bagging in a way that I can sample roughly 1/3 of f class and union it with true class and train naive bayes on it ,
I have two question :
1.How can I do this kind of sampling (like what happens in bagging tool in rapid miner but not sampling the whole dataset but only the major class)
2.what type of naive bayes do you suggest me to use inside baggin ? because there are different implementation of various types of naive bayes in rapidminer ? should it be reweightable ? should it be updateable ?
Thanks
I have a very imbalanced dataset (t: 14% , f:86%) , I want to use bagging in a way that I can sample roughly 1/3 of f class and union it with true class and train naive bayes on it ,
I have two question :
1.How can I do this kind of sampling (like what happens in bagging tool in rapid miner but not sampling the whole dataset but only the major class)
2.what type of naive bayes do you suggest me to use inside baggin ? because there are different implementation of various types of naive bayes in rapidminer ? should it be reweightable ? should it be updateable ?
Thanks
Tagged:
0
Answers
-
Sort by label and split to get 2 data sets with only "t" or only "f".
Use the sampling operator to get data sets of the same size.
Append back together and use the normal Naive Bayes.
On a side note, Naive Bayes is perfectly capable of dealing with skewed data sets.
So this procedure is a bit weird.
The other Naive Bayes implementations are likely to perform worse, because they are designed for different purposes.
Especially the up-datable one. If you wish you can use W-NaiveBayes this one should be as good as the Naive Bayes from Rapid miner.0 -
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="390" width="547">
<operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Rock" width="90" x="180" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="class=Rock"/>
</operator>
<operator activated="true" class="sample" compatibility="5.2.008" expanded="true" height="76" name="SampleR" width="90" x="313" y="30">
<parameter key="sample_size" value="50"/>
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Mine" width="90" x="179" y="120">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="class=Rock"/>
<parameter key="invert_filter" value="true"/>
</operator>
<operator activated="true" class="sample" compatibility="5.2.008" expanded="true" height="76" name="SampleM" width="90" x="313" y="120">
<parameter key="sample_size" value="50"/>
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
</operator>
<operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append" width="90" x="112" y="210"/>
<operator activated="true" class="naive_bayes" compatibility="5.2.008" expanded="true" height="76" name="Naive Bayes" width="90" x="246" y="210"/>
<connect from_op="Retrieve" from_port="output" to_op="Rock" to_port="example set input"/>
<connect from_op="Rock" from_port="example set output" to_op="SampleR" to_port="example set input"/>
<connect from_op="Rock" from_port="original" to_op="Mine" to_port="example set input"/>
<connect from_op="SampleR" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Mine" from_port="example set output" to_op="SampleM" to_port="example set input"/>
<connect from_op="SampleM" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0 -
Thanks for the reply ,
I know what you did but it's not exactly what I meant ,
ok , let's figure it out for this example :
T class : 14%
F class : 86%
I want to split the F class into 3 classes , each 29% (F1,F2,F3) and I want to train our algorithm (e.g. Naive bayes or DTree) over (T U F1) & (T U F2) & (T U F3) and test it .... (the rest is just like how it is done in Bagging , but the problem with bagging is that it splits the WHOLE dataset, it doesn't keep part of a dataset and split the rest like what I explained.
0 -
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="450" width="567">
<operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Rock" width="90" x="180" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="class=Rock"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.2.008" expanded="true" height="112" name="Multiply" width="90" x="447" y="30"/>
<operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Mine" width="90" x="45" y="120">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="class=Rock"/>
<parameter key="invert_filter" value="true"/>
</operator>
<operator activated="true" class="split_data" compatibility="5.2.008" expanded="true" height="112" name="Split Data" width="90" x="45" y="210">
<enumeration key="partitions">
<parameter key="ratio" value="0.34"/>
<parameter key="ratio" value="0.33"/>
<parameter key="ratio" value="0.33"/>
</enumeration>
</operator>
<operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append (3)" width="90" x="179" y="300"/>
<operator activated="true" class="naive_bayes" compatibility="5.2.008" expanded="true" height="76" name="Naive Bayes (3)" width="90" x="313" y="300"/>
<operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append (2)" width="90" x="179" y="210"/>
<operator activated="true" class="naive_bayes" compatibility="5.2.008" expanded="true" height="76" name="Naive Bayes (2)" width="90" x="313" y="210"/>
<operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append" width="90" x="179" y="120"/>
<operator activated="true" class="naive_bayes" compatibility="5.2.008" expanded="true" height="76" name="Naive Bayes" width="90" x="313" y="120"/>
<connect from_op="Retrieve" from_port="output" to_op="Rock" to_port="example set input"/>
<connect from_op="Rock" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Rock" from_port="original" to_op="Mine" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="Append (2)" to_port="example set 1"/>
<connect from_op="Multiply" from_port="output 3" to_op="Append (3)" to_port="example set 1"/>
<connect from_op="Mine" from_port="example set output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Append" to_port="example set 2"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Append (2)" to_port="example set 2"/>
<connect from_op="Split Data" from_port="partition 3" to_op="Append (3)" to_port="example set 2"/>
<connect from_op="Append (3)" from_port="merged set" to_op="Naive Bayes (3)" to_port="training set"/>
<connect from_op="Naive Bayes (3)" from_port="model" to_port="result 3"/>
<connect from_op="Append (2)" from_port="merged set" to_op="Naive Bayes (2)" to_port="training set"/>
<connect from_op="Naive Bayes (2)" from_port="model" to_port="result 2"/>
<connect from_op="Append" from_port="merged set" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="90"/>
<portSpacing port="sink_result 2" spacing="72"/>
<portSpacing port="sink_result 3" spacing="72"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
0 -
The problem is 3 models are resulted here, but I want to have only 1 model like what happens in bagging , bagging 1 model is resulted which is aggregation of inner models
0 -
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="450" width="567">
<operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Rock" width="90" x="180" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="class=Rock"/>
</operator>
<operator activated="true" class="remember" compatibility="5.2.008" expanded="true" height="60" name="Remember" width="90" x="313" y="30">
<parameter key="name" value="R"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Mine" width="90" x="45" y="120">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="class=Rock"/>
<parameter key="invert_filter" value="true"/>
</operator>
<operator activated="true" class="bagging" compatibility="5.2.008" expanded="true" height="76" name="Bagging" width="90" x="293" y="173">
<parameter key="iterations" value="3"/>
<process expanded="true" height="450" width="435">
<operator activated="true" class="recall" compatibility="5.2.008" expanded="true" height="60" name="Recall" width="90" x="45" y="30">
<parameter key="name" value="R"/>
<parameter key="io_object" value="ExampleSet"/>
<parameter key="remove_from_store" value="false"/>
</operator>
<operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append" width="90" x="180" y="30"/>
<operator activated="true" class="naive_bayes" compatibility="5.2.008" expanded="true" height="76" name="Naive Bayes" width="90" x="315" y="30"/>
<connect from_port="training set" to_op="Append" to_port="example set 2"/>
<connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Rock" to_port="example set input"/>
<connect from_op="Rock" from_port="example set output" to_op="Remember" to_port="store"/>
<connect from_op="Rock" from_port="original" to_op="Mine" to_port="example set input"/>
<connect from_op="Mine" from_port="example set output" to_op="Bagging" to_port="training set"/>
<connect from_op="Bagging" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="72"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0