Stacking with bagging as meta learner
djafarsidik
New Altair Community Member
Hi..
I am newbie,
I would like to ask regarding Stacking method in Rapidminer.
So what I want to do is making stacking by using decision tree and naive bayes as base learner and for meta learner I want to use bagging with decision tree in inner process.
Thank you in advance.
I am newbie,
I would like to ask regarding Stacking method in Rapidminer.
So what I want to do is making stacking by using decision tree and naive bayes as base learner and for meta learner I want to use bagging with decision tree in inner process.
For validation I use 10 fold cross validation but I want to get performance result per each fold beside overall result.
More or less the scheme is as this
This is my design
<?xml version="1.0" encoding="UTF-8"?><process version="9.4.001"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.4.001" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="SYSTEM"/><br> <process expanded="true"><br> <operator activated="true" class="retrieve" compatibility="9.4.001" expanded="true" height="68" name="Retrieve bandung_L2" width="90" x="112" y="238"><br> <parameter key="repository_entry" value="//Thesis/data/bandung_L2"/><br> </operator><br> <operator activated="true" class="concurrency:cross_validation" compatibility="9.4.001" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="187"><br> <parameter key="split_on_batch_attribute" value="false"/><br> <parameter key="leave_one_out" value="false"/><br> <parameter key="number_of_folds" value="10"/><br> <parameter key="sampling_type" value="stratified sampling"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="enable_parallel_execution" value="true"/><br> <process expanded="true"><br> <operator activated="true" class="stacking" compatibility="9.4.001" expanded="true" height="68" name="Stacking" width="90" x="112" y="34"><br> <parameter key="keep_all_attributes" value="true"/><br> <parameter key="keep_confidences" value="false"/><br> <process expanded="true"><br> <operator activated="true" class="naive_bayes" compatibility="9.4.001" expanded="true" height="82" name="Naive Bayes" width="90" x="112" y="85"><br> <parameter key="laplace_correction" value="true"/><br> </operator><br> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.4.001" expanded="true" height="103" name="Decision Tree" width="90" x="112" y="187"><br> <parameter key="criterion" value="gain_ratio"/><br> <parameter key="maximal_depth" value="10"/><br> <parameter key="apply_pruning" value="true"/><br> <parameter key="confidence" value="0.1"/><br> <parameter key="apply_prepruning" value="true"/><br> <parameter key="minimal_gain" value="0.01"/><br> <parameter key="minimal_leaf_size" value="2"/><br> <parameter key="minimal_size_for_split" value="4"/><br> <parameter key="number_of_prepruning_alternatives" value="3"/><br> </operator><br> <connect from_port="training set 1" to_op="Naive Bayes" to_port="training set"/><br> <connect from_port="training set 2" to_op="Decision Tree" to_port="training set"/><br> <connect from_op="Naive Bayes" from_port="model" to_port="base model 1"/><br> <connect from_op="Decision Tree" from_port="model" to_port="base model 2"/><br> <portSpacing port="source_training set 1" spacing="0"/><br> <portSpacing port="source_training set 2" spacing="0"/><br> <portSpacing port="source_training set 3" spacing="0"/><br> <portSpacing port="sink_base model 1" spacing="0"/><br> <portSpacing port="sink_base model 2" spacing="0"/><br> <portSpacing port="sink_base model 3" spacing="0"/><br> </process><br> <process expanded="true"><br> <operator activated="true" class="bagging" compatibility="9.4.001" expanded="true" height="82" name="Bagging" width="90" x="112" y="34"><br> <parameter key="sample_ratio" value="0.9"/><br> <parameter key="iterations" value="10"/><br> <parameter key="average_confidences" value="true"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <process expanded="true"><br> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.4.001" expanded="true" height="103" name="Decision Tree (2)" width="90" x="313" y="85"><br> <parameter key="criterion" value="gain_ratio"/><br> <parameter key="maximal_depth" value="10"/><br> <parameter key="apply_pruning" value="true"/><br> <parameter key="confidence" value="0.1"/><br> <parameter key="apply_prepruning" value="true"/><br> <parameter key="minimal_gain" value="0.01"/><br> <parameter key="minimal_leaf_size" value="2"/><br> <parameter key="minimal_size_for_split" value="4"/><br> <parameter key="number_of_prepruning_alternatives" value="3"/><br> </operator><br> <connect from_port="training set" to_op="Decision Tree (2)" to_port="training set"/><br> <connect from_op="Decision Tree (2)" from_port="model" to_port="model"/><br> <portSpacing port="source_training set" spacing="0"/><br> <portSpacing port="sink_model" spacing="0"/><br> </process><br> </operator><br> <connect from_port="stacking examples" to_op="Bagging" to_port="training set"/><br> <connect from_op="Bagging" from_port="model" to_port="stacking model"/><br> <portSpacing port="source_stacking examples" spacing="0"/><br> <portSpacing port="sink_stacking model" spacing="0"/><br> </process><br> </operator><br> <connect from_port="training set" to_op="Stacking" to_port="training set"/><br> <connect from_op="Stacking" from_port="model" to_port="model"/><br> <portSpacing port="source_training set" spacing="0"/><br> <portSpacing port="sink_model" spacing="0"/><br> <portSpacing port="sink_through 1" spacing="0"/><br> </process><br> <process expanded="true"><br> <operator activated="true" class="apply_model" compatibility="9.4.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="85"><br> <list key="application_parameters"/><br> <parameter key="create_view" value="false"/><br> </operator><br> <operator activated="true" class="performance_binominal_classification" compatibility="9.4.001" expanded="true" height="82" name="Performance" width="90" x="246" y="136"><br> <parameter key="manually_set_positive_class" value="false"/><br> <parameter key="main_criterion" value="first"/><br> <parameter key="accuracy" value="true"/><br> <parameter key="classification_error" value="false"/><br> <parameter key="kappa" value="false"/><br> <parameter key="AUC (optimistic)" value="false"/><br> <parameter key="AUC" value="false"/><br> <parameter key="AUC (pessimistic)" value="false"/><br> <parameter key="precision" value="true"/><br> <parameter key="recall" value="true"/><br> <parameter key="lift" value="false"/><br> <parameter key="fallout" value="false"/><br> <parameter key="f_measure" value="false"/><br> <parameter key="false_positive" value="false"/><br> <parameter key="false_negative" value="false"/><br> <parameter key="true_positive" value="false"/><br> <parameter key="true_negative" value="false"/><br> <parameter key="sensitivity" value="false"/><br> <parameter key="specificity" value="false"/><br> <parameter key="youden" value="false"/><br> <parameter key="positive_predictive_value" value="false"/><br> <parameter key="negative_predictive_value" value="false"/><br> <parameter key="psep" value="false"/><br> <parameter key="skip_undefined_labels" value="true"/><br> <parameter key="use_example_weights" value="true"/><br> </operator><br> <connect from_port="model" to_op="Apply Model" to_port="model"/><br> <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/><br> <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/><br> <connect from_op="Performance" from_port="performance" to_port="performance 1"/><br> <portSpacing port="source_model" spacing="0"/><br> <portSpacing port="source_test set" spacing="0"/><br> <portSpacing port="source_through 1" spacing="0"/><br> <portSpacing port="sink_test set results" spacing="0"/><br> <portSpacing port="sink_performance 1" spacing="0"/><br> <portSpacing port="sink_performance 2" spacing="0"/><br> </process><br> </operator><br> <connect from_op="Retrieve bandung_L2" from_port="output" to_op="Cross Validation" to_port="example set"/><br> <connect from_op="Cross Validation" from_port="model" to_port="result 1"/><br> <connect from_op="Cross Validation" from_port="performance 1" to_port="result 2"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> <portSpacing port="sink_result 3" spacing="0"/><br> </process><br> </operator><br></process>
Kindly please advice is my design correct for above purpose (design & data attached) ?, and is it possible to display/store testing example process per each fold ? any comment highly appreciate.
Thank you in advance.
Tagged:
0
Best Answers
-
Hi @djafarsidik,
1/ Extract performance and example set of each fold of the X-validation :
It's very easy. You have to put 2 Store operators in the Testing part of your Cross Validation operator and
use the macro %{execution_count} to name the different files.
See the process_1.rmp in attached file.
2/ Meta-learner(s)
It's difficult for me to check the set-up of your process because I don't understand which meta learner technique you want to use.
In your process you have a mix Stacking/Bagging but from my point of view, the schema you shared is showing a Voting meta-learner...
If it is the case, you have to use the Vote operator and inside this operator put :
- a Decision tree model and
- A Naive Bayes model
See the Process_2.rmp for implementation in attached file.
I hope this helps,
Regards,
Lionel5 -
@djafarsidik,
I don't see any incorrect things in your design.
However, I would use the data-science "methodology" :
I would build all the envisaged models (simple Voting, simple Bagging, simple Stacking, your set-up) and retain only the best one (the highest performance).
Hope this helps,
Regards,
Lionel5
Answers
-
Hi @djafarsidik,
1/ Extract performance and example set of each fold of the X-validation :
It's very easy. You have to put 2 Store operators in the Testing part of your Cross Validation operator and
use the macro %{execution_count} to name the different files.
See the process_1.rmp in attached file.
2/ Meta-learner(s)
It's difficult for me to check the set-up of your process because I don't understand which meta learner technique you want to use.
In your process you have a mix Stacking/Bagging but from my point of view, the schema you shared is showing a Voting meta-learner...
If it is the case, you have to use the Vote operator and inside this operator put :
- a Decision tree model and
- A Naive Bayes model
See the Process_2.rmp for implementation in attached file.
I hope this helps,
Regards,
Lionel5 -
Like Lionel, I am a bit confused about your desired setup.
Stacking and bagging are two different approaches to ensemble modeling. Bagging builds multiple independent models using the same base learner (think random forest built from underlying decision trees) and Stacking uses a master model to decide between different underlying models depending on local performance. Vote is simpler than both of those and allows you to use multiple models everywhere and then just combine their predictions. Your picture appears to depict the Vote approach.
Also Naive Bayes is a very "generalized" model with no tuning parameters and is not likely to vary significantly from subset to subset and is not usually used in any kind of bagging approach.
Finally, why do you want the performance for each fold separately in any case? Is your dataset very small and you anticipate large potential deviation in performance from fold to fold? Looking at fold specific performance is not going to help you compare or evaluate the individual learners you are using in the ensemble.
2 -
Thank you very much @Telcontar120 and @lionelderkrikor for the advice and comments, also thank you for the implementation files, it is very helpful.First please ignore the schematic image that I sent.Actually my idea is kind of experiment, which I could describe like this (please give advice / correction):- The main ensemble method that I want to use is stacking which in my understanding is a combination of several basic methods / base learner that each method will produce the model and their respective results (all method are using same dataset).- Then the results of each basic model will be used as input for the next method (meta learner).- For the meta-learner, I want to use another ensemble method, in this case is bagging which in my understanding is method that will make different sets of input data and use it to teach same algorithm several times and then predict the final answer via simple majority voting.- I don’t know if what I have described can be simplified by using voting or exactly it is voting itself, but what I would like to ask, is it possible to make such of design like I describe without using "ready to use" voting operator ? But surely for this I will also try for using voting as comparison.- The validation method that I want to use here is k-fold cross validation. I need performance results for each fold and overall, and if possible i want to display example input data for meta-learner part per each fold because there is requirement for me to display every detail of the process (kind of proof of concept).0
-
Dear @lionelderkrikor,Could you give advice how to mix Stacking + Bagging correctly in rapidminer?or Is my last design i have sent correct ?Thank you.0
-
@djafarsidik,
I don't see any incorrect things in your design.
However, I would use the data-science "methodology" :
I would build all the envisaged models (simple Voting, simple Bagging, simple Stacking, your set-up) and retain only the best one (the highest performance).
Hope this helps,
Regards,
Lionel5