Problem with overfitting

Hello,

I have a problem with overfitting.

It is a classification with 8 label values and 6 attributes with about 5.5 million values each.

By cross validation with 10 folds, my decision tree reaches an accuracy of about 93%. Unfortunately, when I apply the model to new data, I only get a test accuracy of 33%.

Can anyone tell me how to prevent overfitting on the training data?

I have chosen the following parameters for the decision tree:

criterion: information gain

maximum depth: 30

apply pruning: yes

confidence: 0.24

apply prepruning: yes

minimum gain: 0.0

minimum leaf size: 1

minimum size for slit: 1

number of prepruning alternatives: 0

Greetings

Simon

Find more posts tagged with

AI Studio

Classification

Decision Tree

Cross Validation

Accepted answers

All comments

Hi,

are there duplicates or pseudo duplicates in your data?

Lets say you have production data for items, and items are created in batches. Than two items of the same machine are virtually the same. Cross validation may separate them into train and test set and you 'fool' your validation.

Best,

Martin

Hello @mschmitz,

My project is about combustion. The model is supposed to predict emissions. It may well be that some operating conditions occur more than once.

Does the Remove Duplicates operator help here?

Regards

Simon

hard to say. Do you have more than 1 combustion engine/device and your test set is a different engine? That would totally explain it, because your model may have overfitted on the engine.

Best,

Martin

No, it is waste combustion.

I use the data from 2010 - 2020 as training data and the data from 2021 as test data.

I have also tried to train the model with only 2/3 of the training data and test it with the remaining 1/3 (to exclude that something has changed in the process since 2021), but with the same result (the low test accuracy).

Regards

Simon

Hi,

maybe have a look at this older blog post of mine: https://towardsdatascience.com/when-cross-validation-fails-9bd5a57f07b5 that could be it.

Best,

Martin

Hi @mschmitz

I have now carried out the cross validation with a batch, but with the same result.

I have attached my training dataset (1), my test dataset (2) to this and the XML of my process. The 6 attributes (a1-a6) are used to build a model (decision tree) to predict the label. I get a validation accuracy of 92.33% but only a test accuracy of 37%.

Is there another way to avoid overfitting?

Regards

Simon

<?xml version="1.0" encoding="UTF-8"?><process version="9.9.000">

<context>

<input/>

<output/>

<macros/>

</context>

<operator activated="true" class="process" compatibility="9.9.000" expanded="true" name="Process">

<parameter key="logverbosity" value="init"/>

<parameter key="random_seed" value="2001"/>

<parameter key="send_mail" value="never"/>

<parameter key="notification_email" value=""/>

<parameter key="process_duration_for_mail" value="30"/>

<parameter key="encoding" value="SYSTEM"/>

<process expanded="true">

<operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve 1" width="90" x="45" y="85">

<parameter key="repository_entry" value="../data/1"/>

</operator>

<operator activated="true" class="set_role" compatibility="9.9.000" expanded="true" height="82" name="Set Role" width="90" x="179" y="85">

<parameter key="attribute_name" value="label"/>

<parameter key="target_role" value="label"/>

<list key="set_additional_roles"/>

</operator>

<operator activated="true" class="model_simulator:generate_batch" compatibility="9.9.000" expanded="true" height="68" name="Generate Batch" width="90" x="313" y="85">

<parameter key="batch attribute name" value="batch"/>

<parameter key="number of batches" value="5"/>

</operator>

<operator activated="true" class="concurrency:cross_validation" compatibility="9.9.000" expanded="true" height="145" name="Cross Validation" width="90" x="447" y="85">

<parameter key="split_on_batch_attribute" value="true"/>

<parameter key="leave_one_out" value="false"/>

<parameter key="number_of_folds" value="10"/>

<parameter key="sampling_type" value="automatic"/>

<parameter key="use_local_random_seed" value="false"/>

<parameter key="local_random_seed" value="1992"/>

<parameter key="enable_parallel_execution" value="true"/>

<process expanded="true">

<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.9.000" expanded="true" height="103" name="Decision Tree" width="90" x="179" y="34">

<parameter key="criterion" value="information_gain"/>

<parameter key="maximal_depth" value="30"/>

<parameter key="apply_pruning" value="true"/>

<parameter key="confidence" value="1.0E-7"/>

<parameter key="apply_prepruning" value="true"/>

<parameter key="minimal_gain" value="0.0"/>

<parameter key="minimal_leaf_size" value="1"/>

<parameter key="minimal_size_for_split" value="1"/>

<parameter key="number_of_prepruning_alternatives" value="0"/>

</operator>

<connect from_port="training set" to_op="Decision Tree" to_port="training set"/>

<connect from_op="Decision Tree" from_port="model" to_port="model"/>

<portSpacing port="source_training set" spacing="0"/>

<portSpacing port="sink_model" spacing="0"/>

<portSpacing port="sink_through 1" spacing="0"/>

</process>

<process expanded="true">

<operator activated="true" class="apply_model" compatibility="9.9.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">

<list key="application_parameters"/>

<parameter key="create_view" value="false"/>

</operator>

<operator activated="true" class="performance" compatibility="9.9.000" expanded="true" height="82" name="Performance" width="90" x="246" y="34">

<parameter key="use_example_weights" value="true"/>

</operator>

<connect from_port="model" to_op="Apply Model" to_port="model"/>

<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>

<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>

<connect from_op="Performance" from_port="performance" to_port="performance 1"/>

<portSpacing port="source_model" spacing="0"/>

<portSpacing port="source_test set" spacing="0"/>

<portSpacing port="source_through 1" spacing="0"/>

<portSpacing port="sink_test set results" spacing="0"/>

<portSpacing port="sink_performance 1" spacing="0"/>

<portSpacing port="sink_performance 2" spacing="0"/>

</process>

</operator>

<operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve 2" width="90" x="45" y="238">

<parameter key="repository_entry" value="../data/2"/>

</operator>

<operator activated="true" class="set_role" compatibility="9.9.000" expanded="true" height="82" name="Set Role (2)" width="90" x="447" y="238">

<parameter key="attribute_name" value="label"/>

<parameter key="target_role" value="label"/>

<list key="set_additional_roles"/>

</operator>

<operator activated="true" class="model_simulator:model_simulator" compatibility="9.9.000" expanded="true" height="103" name="Model Simulator" width="90" x="648" y="34"/>

<connect from_op="Retrieve 1" from_port="output" to_op="Set Role" to_port="example set input"/>

<connect from_op="Set Role" from_port="example set output" to_op="Generate Batch" to_port="example set"/>

<connect from_op="Generate Batch" from_port="example set" to_op="Cross Validation" to_port="example set"/>

<connect from_op="Cross Validation" from_port="model" to_op="Model Simulator" to_port="model"/>

<connect from_op="Cross Validation" from_port="example set" to_op="Model Simulator" to_port="training data"/>

<connect from_op="Cross Validation" from_port="performance 1" to_port="result 2"/>

<connect from_op="Retrieve 2" from_port="output" to_op="Set Role (2)" to_port="example set input"/>

<connect from_op="Set Role (2)" from_port="example set output" to_op="Model Simulator" to_port="test data"/>

<connect from_op="Model Simulator" from_port="simulator output" to_port="result 1"/>

<portSpacing port="source_input 1" spacing="0"/>

<portSpacing port="sink_result 1" spacing="0"/>

<portSpacing port="sink_result 2" spacing="0"/>

<portSpacing port="sink_result 3" spacing="0"/>

</process>

</operator>

</process>

Before we go deeper: Are you sure that your test and train set are stemming from the same distribution?

Best,

Martin

Yes, they definitely come from the same distribution.

Regards

Simon

Quick Links