how to apply smote upsampling

hanaabdalrahman · March 2018

hello.. sorry i am new in data mining i have project on classification loan default and my data is imbalanced ..

where i apply smote upsampling before spilt the data or after? my data is not larg only 1030 sample

YYH · March 2018

Do you split data for validation? You can upsample with smote before split/cross validation. If you like you can also apply a "stratified sample" to split data for 10% holdout test set before smote upsampling. Since the stratified holdout sample will keep the similar distribution as the original imbalanced data and can be considered a 'good' representative set for the real life data. You may want to know how good the model perform with the upsampled balanced set, and also more importantly the goodness of fit for future unseen new data from real life.

My example process for handling imbalanced data is attached for reference.

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Customer Churn Data" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/Templates/Lift Chart/Customer Churn Data"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="8.1.001" expanded="true" height="103" name="Split Data" width="90" x="246" y="238">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.1"/>
          <parameter key="ratio" value="0.9"/>
        </enumeration>
        <parameter key="sampling_type" value="stratified sampling"/>
        <description align="center" color="transparent" colored="false" width="126">get 10% hold out for testing</description>
      </operator>
      <operator activated="true" class="operator_toolbox:smote" compatibility="1.0.000" expanded="true" height="82" name="Smote Upsampling" width="90" x="380" y="34">
        <description align="center" color="transparent" colored="false" width="126">use 90% data for model validation</description>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="8.1.001" expanded="true" height="145" name="Cross Validation with SMOTE data" width="90" x="581" y="34">
        <process expanded="true">
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="8.1.001" expanded="true" height="103" name="Decision Tree" width="90" x="179" y="34"/>
          <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="8.1.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_binominal_classification" compatibility="8.1.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
            <parameter key="classification_error" value="true"/>
            <parameter key="AUC" value="true"/>
            <parameter key="recall" value="true"/>
            <parameter key="f_measure" value="true"/>
            <parameter key="sensitivity" value="true"/>
            <parameter key="specificity" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">automatically split data by cross validation</description>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="8.1.001" expanded="true" height="145" name="Cross Validation without SMOTE" width="90" x="581" y="340">
        <process expanded="true">
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="8.1.001" expanded="true" height="103" name="Decision Tree (2)" width="90" x="179" y="34"/>
          <connect from_port="training set" to_op="Decision Tree (2)" to_port="training set"/>
          <connect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="8.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_binominal_classification" compatibility="8.1.001" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34">
            <parameter key="classification_error" value="true"/>
            <parameter key="AUC" value="true"/>
            <parameter key="recall" value="true"/>
            <parameter key="f_measure" value="true"/>
            <parameter key="sensitivity" value="true"/>
            <parameter key="specificity" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="8.1.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="916" y="238">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_binominal_classification" compatibility="8.1.001" expanded="true" height="82" name="Performance on 10% holdout" width="90" x="1050" y="238">
        <parameter key="classification_error" value="true"/>
        <parameter key="AUC" value="true"/>
        <parameter key="precision" value="true"/>
        <parameter key="recall" value="true"/>
      </operator>
      <connect from_op="Retrieve Customer Churn Data" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Apply Model (3)" to_port="unlabelled data"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Smote Upsampling" to_port="exa"/>
      <connect from_op="Smote Upsampling" from_port="ups" to_op="Cross Validation with SMOTE data" to_port="example set"/>
      <connect from_op="Smote Upsampling" from_port="ori" to_op="Cross Validation without SMOTE" to_port="example set"/>
      <connect from_op="Cross Validation with SMOTE data" from_port="model" to_op="Apply Model (3)" to_port="model"/>
      <connect from_op="Cross Validation with SMOTE data" from_port="performance 1" to_port="result 1"/>
      <connect from_op="Cross Validation without SMOTE" from_port="performance 1" to_port="result 2"/>
      <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance on 10% holdout" to_port="labelled data"/>
      <connect from_op="Performance on 10% holdout" from_port="performance" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Cheers,

YY

kypexin · March 2018

Hi @hanaabdalrahman @yyhuang

I personally wouldn't upsample before splitting for a mere reason that in this case you will end up with synthetic examples in the test set, which then could distort testing results. So I would follow the common sense which suggests that upsampling is meant for artificially balancing data used for training the model, but it still should be tested on original unbalanced sample to show true performance. In this sense YY's process is the one you'd need to use.

Thomas_Ott · March 2018

Just to chime in here, I think @kypexin's approach is correct. Upsampling during modeling building is the approach I would use too.

hanaabdalrahman · March 2018

thanks but if i apply it after spilt the data the result stil not ok you can see the confusion matrix befor and after spilt

YYH · March 2018

Hi @hanaabdalrahman,

In your process your are doing split validation to check the performance of DT model on test data. You will have to upsample before split.

In my exmaple process, I did have 2 split. First split is before the upsample to have 10% holdout, and another split is inside the validation which is using the upsampled data. With a validated model trained with balanced data, it makes you more confident to apply it on the 10% holdout.

In your case, you may have to upsample before split validation.

YY

fiddinyusfida · September 2019

Hi @yyhuang

Where can I find the SMOTE feature in RM 9.3.1? I tried to find in the market but nothing has shown. Thanks

varunm1 · September 2019

Hello @fiddinyusfida

The smote operator is in "Operator Toolbox" that need to be installed from the market place in rapidminer.

fiddinyusfida · September 2019

Hi @varunm1

Thank you so much i found it

how to apply smote upsampling

Answers

Categories