[Solved]UNBALANCED DATA - Newbie Question

dynera
dynera New Altair Community Member
edited November 5 in Community Q&A
Hello All,

I am new to this forum and I have read through previous posts but I'm not understanding the basic steps needed to set up a process to balance data.

I have a label with the following split (97% = Y, 3% = N).  I have used WEKA's "resample" filter in the past which does what I would like to do in RapidMiner.  Essentially you can expand your under-represented value to match your over-represented value.  My questions is, which operator(s) should I use and with which settings?

Sorry for the rookie question,

Paul

Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hey Paul,

    if you can live with the fact that both classes are sampled with replacement, then you can use the Sample (Bootstrapping) operator with weighted sampling: just assign a higher weight to the minority class, such that it is more likely to be sampled. This is done beforehand with the GenerateAttributes operator. Then the weights attribute must be assigned the role "weight". Please have a look at the attached process for the details and come back here if you have any questions left.

    For alternatives, please have a look at this thread, there is quite some discussion on the topic: http://rapid-i.com/rapidforum/index.php/topic,2190.0.html

    All the best,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="386" width="748">
          <operator activated="true" class="subprocess" compatibility="5.3.000" expanded="true" height="76" name="Create imbalanced data" width="90" x="45" y="30">
            <process expanded="true" height="506" width="821">
              <operator activated="true" class="generate_data" compatibility="5.3.000" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
                <parameter key="number_examples" value="1000"/>
                <parameter key="number_of_attributes" value="1"/>
                <parameter key="attributes_lower_bound" value="0.0"/>
                <parameter key="attributes_upper_bound" value="1.0"/>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="5.3.000" expanded="true" height="76" name="Generate Attributes" width="90" x="246" y="30">
                <list key="function_descriptions">
                  <parameter key="label" value="if(att1&gt;0.9,1,0)"/>
                </list>
              </operator>
              <connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="5.3.000" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="246" y="30">
            <list key="function_descriptions">
              <parameter key="weight" value="if(label==1,10,1)"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.3.000" expanded="true" height="76" name="Set Role" width="90" x="380" y="30">
            <parameter key="name" value="weight"/>
            <parameter key="target_role" value="weight"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="sample_bootstrapping" compatibility="5.3.000" expanded="true" height="76" name="Sample (Bootstrapping)" width="90" x="514" y="30">
            <parameter key="sample" value="absolute"/>
            <parameter key="sample_size" value="1000"/>
          </operator>
          <connect from_op="Create imbalanced data" from_port="out 1" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Sample (Bootstrapping)" to_port="example set input"/>
          <connect from_op="Sample (Bootstrapping)" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • dynera
    dynera New Altair Community Member
    Thanks Marius - Much appreciated!  ;D