Stratification: How to get the same number of examples for each class?
JohnQuest
New Altair Community Member
I have a data set of 2 labels, label A(6000 items), label B(500items).
I want to run a 10-fold cross validation but with sampling. For example: the 1st fold has 600 of label A and 50 of label B. we want to sample 50 label A out and create a new 1st fold with 50 label A and 50 label B. Same process for rest of 8 folds and we use 9 folds together to training and 1 fold of non-sampled data to testing. The process loop through for the entire data set and collect the performance.
So far I am able to do the above process one fold by one fold which is time consuming. I was hoping to set up a process to do them automatically.
Thanks in advance for your support
John Quest
I want to run a 10-fold cross validation but with sampling. For example: the 1st fold has 600 of label A and 50 of label B. we want to sample 50 label A out and create a new 1st fold with 50 label A and 50 label B. Same process for rest of 8 folds and we use 9 folds together to training and 1 fold of non-sampled data to testing. The process loop through for the entire data set and collect the performance.
So far I am able to do the above process one fold by one fold which is time consuming. I was hoping to set up a process to do them automatically.
Thanks in advance for your support
John Quest
Tagged:
0
Answers
-
Hi,
There is no need to repeat your question. What is the difference between doing what you describe and using standard XValidation with stratified sampling, applied on an example set with 50% label A and 50% label B? If you post your XML people will take more interest.
0 -
my set up is as follows, I am wondering how to make operator "sample" automatically set the sample size according to the size of operator "filter sample" the one use parameter setting correctness=correct
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="386" width="681">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="38" y="77">
<parameter key="repository_entry" value="../data talbe/157000_85"/>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="75">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="back_freq|back_avg_distance|candidate_len|freq_keyword|snippets|suppE|suppC|keyword_id_ch|correctness|roverd|ranking|dis|lift|front_freq"/>
</operator>
<operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="313" y="75">
<process expanded="true" height="431" width="373">
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="112" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=wrong"/>
</operator>
<operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="246" y="30">
<parameter key="sample_size" value="5661"/>
</operator>
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="112" y="165">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=correct"/>
</operator>
<operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="246" y="165"/>
<operator activated="true" class="naive_bayes" expanded="true" height="76" name="Naive Bayes" width="90" x="246" y="300"/>
<connect from_port="training" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="414" width="373">
<operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="51" y="43">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" expanded="true" height="76" name="Performance" width="90" x="227" y="44">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_port="result 2"/>
<connect from_op="Validation" from_port="training" to_port="result 1"/>
<connect from_op="Validation" from_port="averagable 1" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
0 -
Hi,
this is clearly going far beyond of the scope of this board (and actually also of this forum). A process like this isn't made within a minute.
However, I have created a process for the desired task and uploaded it with the Community Extension of RapidMiner under the name "Same Number of Examples per Class (Stratification; Loops and Macros)". Just download and install the Community Extension and search for the process (search in this forum for more information, some infos can also be found in my signature below).
Cheers,
Ingo0 -
Greetings O Pointy One,
You beat me to it! Drat ! Can we not have a badge/smiley pointing folks there, lest we have to repeat ( this exact one of balancing data comes up repeatedly ).
0 -
I might have been faster but the solution can still be optimized ;D A good idea would be to extract the label automatically without having the user define it via a macro. The second thing is that I loose one example in the minority class ::)
Anyway, I moved the discussion into this board here and made it also sticky so that we can easily link to this one in future.
Cheers,
Ingo0 -
Hi,
I think this covers the points you made - must say I found the 'Append' operator placement a challenge, still it does show the world of collections at work..<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="335" width="791">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="120">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="extract_macro" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="120">
<parameter key="macro" value="exs"/>
</operator>
<operator activated="true" class="loop_values" expanded="true" height="76" name="Loop Values" width="90" x="313" y="120">
<parameter key="attribute" value="class"/>
<process expanded="true" height="453" width="809">
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="141" y="94">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="class=%{loop_value}"/>
</operator>
<operator activated="true" class="extract_macro" expanded="true" height="60" name="Extract Macro (2)" width="90" x="313" y="75">
<parameter key="macro" value="subexs"/>
</operator>
<operator activated="true" class="generate_macro" expanded="true" height="76" name="Generate Macro" width="90" x="447" y="75">
<list key="function_descriptions">
<parameter key="exs" value="min(%{subexs},%{exs})"/>
</list>
</operator>
<connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro (2)" to_port="example set"/>
<connect from_op="Extract Macro (2)" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
<connect from_op="Generate Macro" from_port="through 1" to_port="out 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="loop_collection" expanded="true" height="76" name="Loop Collection" width="90" x="447" y="120">
<parameter key="unfold" value="true"/>
<parameter key="parallelize_iteration" value="true"/>
<process expanded="true" height="353" width="809">
<operator activated="true" class="sample" expanded="true" height="76" name="Sample" width="90" x="269" y="53">
<parameter key="sample_size" value="%{exs}"/>
</operator>
<connect from_port="single" to_op="Sample" to_port="example set input"/>
<connect from_op="Sample" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" expanded="true" height="76" name="Append" width="90" x="581" y="120"/>
<connect from_op="Retrieve" from_port="output" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Loop Values" to_port="example set"/>
<connect from_op="Loop Values" from_port="out 1" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0 -
Thanks, I will try it out
John0 -
Dear All
I still having some problem understand the last XML post by haddock, I cannot connect the macros to two outputs.
My question is still regarding my XML post on 10 June, I make it simpler and only looking at the problem this time, please see the attached XML codes.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="396" width="779">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
<parameter key="repository_entry" value="//Project CE/cep8/data talbe/157000_85"/>
</operator>
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="179" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=wrong"/>
</operator>
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="165">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=correct"/>
</operator>
<operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="380" y="30">
<parameter key="sample_size" value="1662"/>
</operator>
<operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="514" y="120"/>
<connect from_op="Retrieve" from_port="output" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
We want the operator "sample_stratified" take the exact amount according to the number of examples from "filter_examples 1" value="correctness=correct". Any ideas, thanks in advance for your support.
John
0 -
Did you try the process I have uploaded with the Community Extension? Could help here...
Cheers,
Ingo0 -
Dear Ingo
Sorry for this question, how do I access the files uploaded in community extension, thanks.
Best regards
John0 -
Hi,
no problem. You can find some explanations here in the forum:- Look here: http://rapid-i.com/rapidforum/index.php/topic,1992.0.html (first hit in forum search for "Community Extension" by the way...)
- Or here: http://rapid-i.com/rapidforum/index.php/topic,2254.msg8888.html#msg8888
- Follow the description and the link in my signature (yes, the small text under each of my posts )
Cheers,
Ingo0 -
Dear Ingo Mierswa
Thanks, sorry for the late reply, sometimes it is difficult to come back to my posts, besides from "show new replies", the only way I can find my post is from profile. would you tell me another way, thanks.
I found your process named "same number of examples per class" I can not understand what does "extract marco" and "loop process" do, since there is no output after "loop process". Thanks in advance for your support.
John Quest
0 -
Dear John Quest,
(I thought we were already at the stage of using "John" and "Ingo" )
What exactly do you not understand? The first loop values is only used for calculating the size of the minimal class and storing this size in a macro.
I found your process named "same number of examples per class" I can not understand what does "extract marco" and "loop process" do, since there is no output after "loop process". Thanks in advance for your support.
Cheers,
Ingo (Mierswa )0 -
Dear Ingo
Thanks, I may modified it into something more interesting and upload it to the community, may need your help if I got problems, thanks in advance for your support.
Best Regards
John0