How can compare decision tree and linear regression using Cross-Validated or X-Validated?

mylane3
mylane3 New Altair Community Member
edited November 2024 in Community Q&A
Spoiler
 

Attached are some relavent pictures of my set up and stats on the target variable:

1) The Setup of my .rmp , 2) Picture of the Histogram of the Target Variable, 3) Plot of CO2 (target variable) vs. Primary Principle Component

 

I best compare models with Cross Validation to figure out especially between categorical models like decision trees vs. numerical models like linear regression.  I have been learning about cross validation in my Rapidminer class, but I am not 100% sure what exactly accuracy, precision, and recall are for classification prediction and regression prediction operators.  For example, I would like to use precision and class recall to compare models, but I don't know what they might be for regression because the confusion matrix is based on nomial label not a numerical label. 

 

So How can compare decision tree and linear regression using Cross-Validated or X-Validated?  What statistic or metric could I use?

 

Below are the stats output from my results:

Target Variable Stats
CO2 Emissions Average: 87,405.93 Deviation: 628 363.80
Performance of Linear Regression:
root_mean_squared_error: 34,017.261 +/- 5548.473 (mikro: 34467.846 +/- 0.000)
normalized_absolute_error: 0.151 +/- 0.040 (mikro: 0.140)
Performance of Decision Tree:
accuracy: 90.65% +/- 4.13% (mikro: 90.64%)
root_mean_squared_error: 0.282 +/- 0.063 (mikro: 0.289 +/- 0.000)
normalized_absolute_error: 3.143 +/- 5.800 (mikro: 1.209)
Avg. Class Precision: 62.1%
Avg. Class Recall: 68%

Performance of Decision Random Forest:
accuracy: 82.14% +/- 1.92% (mikro: 82.14%)
root_mean_squared_error: 0.392 +/- 0.025 (mikro: 0.393 +/- 0.000)
normalized_absolute_error: 1.017 +/- 0.117 (mikro: 0.993)
Avg. Class Precision: 86.3%
Avg. Class Recall: 40%
Performance of Neural Network:
root_mean_squared_error: 23,815.976 +/- 4305.543 (mikro: 24211.353 +/- 0.00)
normalized_absolute_error: 0.126 +/- 0.037 (mikro: 0.122)
Performance of General Linearized Model (Default values):
root_mean_squared_error: 21,9027.497 +/- 45537.878
normalized_absolute_error: 1.017 +/- 0.117 (mikro: 0.993)

Best Answer

  • JEdward
    JEdward New Altair Community Member
    Answer ✓

    What you might want to do is transform your Performance for the regression into a classification result by Discretizing the Label & Prediction variables with the same rules you applied for you Domain Expert defined bins. 

     

    Then you are comparing like for like. 

    However

    One caution I would give on your classification prediction is to think about how your classification model is measured against misclassifications. 

     

    Imagine you have a numerical label with values 1 to 10. 

    After binning your label has the following nominal values.

    Value 1: 1-3

    Value 2: 4-6

    Value 3: 7-9

    Value 4: 10

     

    Now if your classification model predicts something with an original numeric value of 3 and it predicts that it is in group 'Value 2: 4-6', then although this is a misclassification it is actually more accurate than if it had predicted 'Value 4: 10'.  However, just looking purely at Accuracy, Precision & Recall won't reflect this.  Both misclassifications as 'Value 4:10' and 'Value 2:4-6' have the same performance value 0... which is just not correct.

     

    I would recommend that you use the Performance (Costs) operator and create a misclassification costs matrix.   That way you can reflect that misclassifications in nearby groups are 'less costly' than those in more distance groups. 

     

     

Answers

  • JEdward
    JEdward New Altair Community Member
    Answer ✓

    What you might want to do is transform your Performance for the regression into a classification result by Discretizing the Label & Prediction variables with the same rules you applied for you Domain Expert defined bins. 

     

    Then you are comparing like for like. 

    However

    One caution I would give on your classification prediction is to think about how your classification model is measured against misclassifications. 

     

    Imagine you have a numerical label with values 1 to 10. 

    After binning your label has the following nominal values.

    Value 1: 1-3

    Value 2: 4-6

    Value 3: 7-9

    Value 4: 10

     

    Now if your classification model predicts something with an original numeric value of 3 and it predicts that it is in group 'Value 2: 4-6', then although this is a misclassification it is actually more accurate than if it had predicted 'Value 4: 10'.  However, just looking purely at Accuracy, Precision & Recall won't reflect this.  Both misclassifications as 'Value 4:10' and 'Value 2:4-6' have the same performance value 0... which is just not correct.

     

    I would recommend that you use the Performance (Costs) operator and create a misclassification costs matrix.   That way you can reflect that misclassifications in nearby groups are 'less costly' than those in more distance groups. 

     

     

  • mylane3
    mylane3 New Altair Community Member

    Thanks, I eventually realized that it really is like trying to compare fruit and vegitables.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.