Random Forest (and other Trees) producing total bias result

RMBP · September 2017

Hello! I am predicted whether or not an example will be LOW and HIGH, based on several attributes.

The challenge I am seeing is that the classifier is always choosing to predict everyone as HIGH.

I have seen this before in other programs when the model is bias - sometimes based on a variable or algorithm setting. I have checked these out and can't find the issue.

Does anyone have experience with getting a total bias on the prediction and what might be causing it.... checked with other data sets and the same result is produced. Thanks!

UPDATE: I opened a new project, used the Titanic Training dataset directly into a default RFC and it predicts all examples will not survive. So it appears something is happening outside of the data or settings themselves...

FBT · September 2017

Is your data imbalanced? I.e. does the number of examples with one label exceed the number of examples of the other label significantly? Other reasons most likely relate to data preparation, e.g. feature selection or just plain and simple cleaning of the input data.

You may want to compare the results of a regular Decision Tree on the Titanic dataset with that of a RF and then work your way up by spending some time on data preparation.

JEdward · September 2017

Can you share your process so we can have a look?

I expect this is not a problem with the classifier, but actually how the data was prepared. As you can see, (taking your Titanic example) that passenger fare is one of the key predictors, but as this isn't normalized to the rest of the attributes so a RF will treat changes to passenger fare the same as changes to gender. This makes the model weighted heavily towards these numerical values. Remember, that a Random Forest is essentially just a bunch of decision trees linked together with Naive Bayes. (The same way that Deep Learning is essentially just an ensemble of LogRegressions). Your feature selections are very important and you should spend most of your time here and NOT in the modelling stage. Throw a discretization into your data prep stage and see what happens.

My "default" RF settings give me a result of 2 Survive, 390 do not survive.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
      </operator>
      <operator activated="true" class="concurrency:parallel_random_forest" compatibility="7.6.001" expanded="true" height="82" name="Random Forest" width="90" x="380" y="136"/>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Titanic Unlabeled" width="90" x="179" y="340">
        <parameter key="repository_entry" value="//Samples/data/Titanic Unlabeled"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="581" y="187">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Random Forest" to_port="training set"/>
      <connect from_op="Random Forest" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Retrieve Titanic Unlabeled" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

My "default" RF settings with a discretization give me a result of 144 Survive, 248 do not survive.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
      </operator>
      <operator activated="true" class="discretize_by_entropy" compatibility="7.6.001" expanded="true" height="103" name="Discretize" width="90" x="179" y="34"/>
      <operator activated="true" class="concurrency:parallel_random_forest" compatibility="7.6.001" expanded="true" height="82" name="Random Forest" width="90" x="313" y="34"/>
      <operator activated="true" class="group_models" compatibility="7.6.001" expanded="true" height="103" name="Group Models" width="90" x="447" y="34"/>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Titanic Unlabeled" width="90" x="313" y="289">
        <parameter key="repository_entry" value="//Samples/data/Titanic Unlabeled"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="581" y="136">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Discretize" to_port="example set input"/>
      <connect from_op="Discretize" from_port="example set output" to_op="Random Forest" to_port="training set"/>
      <connect from_op="Discretize" from_port="preprocessing model" to_op="Group Models" to_port="models in 1"/>
      <connect from_op="Random Forest" from_port="model" to_op="Group Models" to_port="models in 2"/>
      <connect from_op="Group Models" from_port="model out" to_op="Apply Model" to_port="model"/>
      <connect from_op="Retrieve Titanic Unlabeled" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Random Forest (and other Trees) producing total bias result

Answers

Categories