Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

[Solved]UNBALANCED DATA - Newbie Question

Hello All,

I am new to this forum and I have read through previous posts but I'm not understanding the basic steps needed to set up a process to balance data.

I have a label with the following split (97% = Y, 3% = N). I have used WEKA's "resample" filter in the past which does what I would like to do in RapidMiner. Essentially you can expand your under-represented value to match your over-represented value. My questions is, which operator(s) should I use and with which settings?

Sorry for the rookie question,

Paul

Find more posts tagged with

AI Studio

Accepted answers

All comments

MariusHelf

Hey Paul,

if you can live with the fact that both classes are sampled with replacement, then you can use the Sample (Bootstrapping) operator with weighted sampling: just assign a higher weight to the minority class, such that it is more likely to be sampled. This is done beforehand with the GenerateAttributes operator. Then the weights attribute must be assigned the role "weight". Please have a look at the attached process for the details and come back here if you have any questions left.

For alternatives, please have a look at this thread, there is quite some discussion on the topic: http://rapid-i.com/rapidforum/index.php/topic,2190.0.html

All the best,
Marius

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
    <process expanded="true" height="386" width="748">
      <operator activated="true" class="subprocess" compatibility="5.3.000" expanded="true" height="76" name="Create imbalanced data" width="90" x="45" y="30">
        <process expanded="true" height="506" width="821">
          <operator activated="true" class="generate_data" compatibility="5.3.000" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="attributes_lower_bound" value="0.0"/>
            <parameter key="attributes_upper_bound" value="1.0"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="5.3.000" expanded="true" height="76" name="Generate Attributes" width="90" x="246" y="30">
            <list key="function_descriptions">
              <parameter key="label" value="if(att1&gt;0.9,1,0)"/>
            </list>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="5.3.000" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="246" y="30">
        <list key="function_descriptions">
          <parameter key="weight" value="if(label==1,10,1)"/>
        </list>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.000" expanded="true" height="76" name="Set Role" width="90" x="380" y="30">
        <parameter key="name" value="weight"/>
        <parameter key="target_role" value="weight"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="sample_bootstrapping" compatibility="5.3.000" expanded="true" height="76" name="Sample (Bootstrapping)" width="90" x="514" y="30">
        <parameter key="sample" value="absolute"/>
        <parameter key="sample_size" value="1000"/>
      </operator>
      <connect from_op="Create imbalanced data" from_port="out 1" to_op="Generate Attributes (2)" to_port="example set input"/>
      <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Sample (Bootstrapping)" to_port="example set input"/>
      <connect from_op="Sample (Bootstrapping)" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

dynera

Thanks Marius - Much appreciated! ;D