Problem with overfitting
SimonK
New Altair Community Member
Hello,
I have a problem with overfitting.
It is a classification with 8 label values and 6 attributes with about 5.5 million values each.
By cross validation with 10 folds, my decision tree reaches an accuracy of about 93%. Unfortunately, when I apply the model to new data, I only get a test accuracy of 33%.
Can anyone tell me how to prevent overfitting on the training data?
I have chosen the following parameters for the decision tree:
criterion: information gain
maximum depth: 30
apply pruning: yes
confidence: 0.24
apply prepruning: yes
minimum gain: 0.0
minimum leaf size: 1
minimum size for slit: 1
number of prepruning alternatives: 0
Greetings
Simon
0
Answers
-
Hi,are there duplicates or pseudo duplicates in your data?Lets say you have production data for items, and items are created in batches. Than two items of the same machine are virtually the same. Cross validation may separate them into train and test set and you 'fool' your validation.Best,Martin0
-
Hi @SimonK,hard to say. Do you have more than 1 combustion engine/device and your test set is a different engine? That would totally explain it, because your model may have overfitted on the engine.Best,Martin0
-
@mschmitzNo, it is waste combustion.I use the data from 2010 - 2020 as training data and the data from 2021 as test data.I have also tried to train the model with only 2/3 of the training data and test it with the remaining 1/3 (to exclude that something has changed in the process since 2021), but with the same result (the low test accuracy).RegardsSimon0
-
Hi,maybe have a look at this older blog post of mine: https://towardsdatascience.com/when-cross-validation-fails-9bd5a57f07b5 that could be it.Best,Martin
0 -
I have attached my training dataset (1), my test dataset (2) to this and the XML of my process. The 6 attributes (a1-a6) are used to build a model (decision tree) to predict the label. I get a validation accuracy of 92.33% but only a test accuracy of 37%.Is there another way to avoid overfitting?RegardsSimon<?xml version="1.0" encoding="UTF-8"?><process version="9.9.000"><context><input/><output/><macros/></context><operator activated="true" class="process" compatibility="9.9.000" expanded="true" name="Process"><parameter key="logverbosity" value="init"/><parameter key="random_seed" value="2001"/><parameter key="send_mail" value="never"/><parameter key="notification_email" value=""/><parameter key="process_duration_for_mail" value="30"/><parameter key="encoding" value="SYSTEM"/><process expanded="true"><operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve 1" width="90" x="45" y="85"><parameter key="repository_entry" value="../data/1"/></operator><operator activated="true" class="set_role" compatibility="9.9.000" expanded="true" height="82" name="Set Role" width="90" x="179" y="85"><parameter key="attribute_name" value="label"/><parameter key="target_role" value="label"/><list key="set_additional_roles"/></operator><operator activated="true" class="model_simulator:generate_batch" compatibility="9.9.000" expanded="true" height="68" name="Generate Batch" width="90" x="313" y="85"><parameter key="batch attribute name" value="batch"/><parameter key="number of batches" value="5"/></operator><operator activated="true" class="concurrency:cross_validation" compatibility="9.9.000" expanded="true" height="145" name="Cross Validation" width="90" x="447" y="85"><parameter key="split_on_batch_attribute" value="true"/><parameter key="leave_one_out" value="false"/><parameter key="number_of_folds" value="10"/><parameter key="sampling_type" value="automatic"/><parameter key="use_local_random_seed" value="false"/><parameter key="local_random_seed" value="1992"/><parameter key="enable_parallel_execution" value="true"/><process expanded="true"><operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.9.000" expanded="true" height="103" name="Decision Tree" width="90" x="179" y="34"><parameter key="criterion" value="information_gain"/><parameter key="maximal_depth" value="30"/><parameter key="apply_pruning" value="true"/><parameter key="confidence" value="1.0E-7"/><parameter key="apply_prepruning" value="true"/><parameter key="minimal_gain" value="0.0"/><parameter key="minimal_leaf_size" value="1"/><parameter key="minimal_size_for_split" value="1"/><parameter key="number_of_prepruning_alternatives" value="0"/></operator><connect from_port="training set" to_op="Decision Tree" to_port="training set"/><connect from_op="Decision Tree" from_port="model" to_port="model"/><portSpacing port="source_training set" spacing="0"/><portSpacing port="sink_model" spacing="0"/><portSpacing port="sink_through 1" spacing="0"/></process><process expanded="true"><operator activated="true" class="apply_model" compatibility="9.9.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34"><list key="application_parameters"/><parameter key="create_view" value="false"/></operator><operator activated="true" class="performance" compatibility="9.9.000" expanded="true" height="82" name="Performance" width="90" x="246" y="34"><parameter key="use_example_weights" value="true"/></operator><connect from_port="model" to_op="Apply Model" to_port="model"/><connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/><connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/><connect from_op="Performance" from_port="performance" to_port="performance 1"/><portSpacing port="source_model" spacing="0"/><portSpacing port="source_test set" spacing="0"/><portSpacing port="source_through 1" spacing="0"/><portSpacing port="sink_test set results" spacing="0"/><portSpacing port="sink_performance 1" spacing="0"/><portSpacing port="sink_performance 2" spacing="0"/></process></operator><operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve 2" width="90" x="45" y="238"><parameter key="repository_entry" value="../data/2"/></operator><operator activated="true" class="set_role" compatibility="9.9.000" expanded="true" height="82" name="Set Role (2)" width="90" x="447" y="238"><parameter key="attribute_name" value="label"/><parameter key="target_role" value="label"/><list key="set_additional_roles"/></operator><operator activated="true" class="model_simulator:model_simulator" compatibility="9.9.000" expanded="true" height="103" name="Model Simulator" width="90" x="648" y="34"/><connect from_op="Retrieve 1" from_port="output" to_op="Set Role" to_port="example set input"/><connect from_op="Set Role" from_port="example set output" to_op="Generate Batch" to_port="example set"/><connect from_op="Generate Batch" from_port="example set" to_op="Cross Validation" to_port="example set"/><connect from_op="Cross Validation" from_port="model" to_op="Model Simulator" to_port="model"/><connect from_op="Cross Validation" from_port="example set" to_op="Model Simulator" to_port="training data"/><connect from_op="Cross Validation" from_port="performance 1" to_port="result 2"/><connect from_op="Retrieve 2" from_port="output" to_op="Set Role (2)" to_port="example set input"/><connect from_op="Set Role (2)" from_port="example set output" to_op="Model Simulator" to_port="test data"/><connect from_op="Model Simulator" from_port="simulator output" to_port="result 1"/><portSpacing port="source_input 1" spacing="0"/><portSpacing port="sink_result 1" spacing="0"/><portSpacing port="sink_result 2" spacing="0"/><portSpacing port="sink_result 3" spacing="0"/></process></operator></process>0
-
Before we go deeper: Are you sure that your test and train set are stemming from the same distribution?Best,Martin0
-
Yes, they definitely come from the same distribution.RegardsSimon0