How can compare decision tree and linear regression using Cross-Validated or X-Validated?

New Altair Community Member

Dec 10, 2016

Updated Nov 5, 2024 by Jocelyn

Attached are some relavent pictures of my set up and stats on the target variable:

1) The Setup of my .rmp , 2) Picture of the Histogram of the Target Variable, 3) Plot of CO2 (target variable) vs. Primary Principle Component

I best compare models with Cross Validation to figure out especially between categorical models like decision trees vs. numerical models like linear regression. I have been learning about cross validation in my Rapidminer class, but I am not 100% sure what exactly accuracy, precision, and recall are for classification prediction and regression prediction operators. For example, I would like to use precision and class recall to compare models, but I don't know what they might be for regression because the confusion matrix is based on nomial label not a numerical label.

So How can compare decision tree and linear regression using Cross-Validated or X-Validated? What statistic or metric could I use?

Below are the stats output from my results:

Target Variable Stats
CO2 Emissions Average:	87,405.93	Deviation:	628 363.80
Performance of Linear Regression:
root_mean_squared_error: 	34,017.261 +/- 5548.473 (mikro: 34467.846 +/- 0.000)
normalized_absolute_error:	0.151 +/- 0.040 (mikro: 0.140)
Performance of Decision Tree:
accuracy: 90.65% +/- 4.13% (mikro: 90.64%)
root_mean_squared_error:	 0.282 +/- 0.063 (mikro: 0.289 +/- 0.000)
normalized_absolute_error:	3.143 +/- 5.800 (mikro: 1.209)
Avg. Class Precision: 62.1%
Avg. Class Recall: 68%

Performance of Decision Random Forest:
accuracy: 82.14% +/- 1.92% (mikro: 82.14%)
root_mean_squared_error:	0.392 +/- 0.025 (mikro: 0.393 +/- 0.000)
normalized_absolute_error: 	1.017 +/- 0.117 (mikro: 0.993)
Avg. Class Precision: 86.3%
Avg. Class Recall: 40%
Performance of Neural Network:
root_mean_squared_error:	23,815.976 +/- 4305.543 (mikro: 24211.353 +/- 0.00)
normalized_absolute_error:	0.126 +/- 0.037 (mikro: 0.122)
Performance of General Linearized Model (Default values):
root_mean_squared_error: 	21,9027.497 +/- 45537.878
normalized_absolute_error: 	1.017 +/- 0.117 (mikro: 0.993)

Find more posts tagged with

Sort by:

1 - 1 of 11

JEdward

New Altair Community Member

Accepted Answer

Dec 11, 2016

What you might want to do is transform your Performance for the regression into a classification result by Discretizing the Label & Prediction variables with the same rules you applied for you Domain Expert defined bins.

Then you are comparing like for like.

However

One caution I would give on your classification prediction is to think about how your classification model is measured against misclassifications.

Imagine you have a numerical label with values 1 to 10.

After binning your label has the following nominal values.

Value 1: 1-3

Value 2: 4-6

Value 3: 7-9

Value 4: 10

Now if your classification model predicts something with an original numeric value of 3 and it predicts that it is in group 'Value 2: 4-6', then although this is a misclassification it is actually more accurate than if it had predicted 'Value 4: 10'. However, just looking purely at Accuracy, Precision & Recall won't reflect this. Both misclassifications as 'Value 4:10' and 'Value 2:4-6' have the same performance value 0... which is just not correct.

I would recommend that you use the Performance (Costs) operator and create a misclassification costs matrix. That way you can reflect that misclassifications in nearby groups are 'less costly' than those in more distance groups.

View in context

Quick Links

Target Variable Stats
CO2 Emissions Average: 87,405.93 Deviation: 628 363.80
Performance of Linear Regression:
root_mean_squared_error: 34,017.261 +/- 5548.473 (mikro: 34467.846 +/- 0.000)
normalized_absolute_error: 0.151 +/- 0.040 (mikro: 0.140)
Performance of Decision Tree:
accuracy: 90.65% +/- 4.13% (mikro: 90.64%)
root_mean_squared_error: 0.282 +/- 0.063 (mikro: 0.289 +/- 0.000)
normalized_absolute_error: 3.143 +/- 5.800 (mikro: 1.209)
Avg. Class Precision: 62.1%
Avg. Class Recall: 68%

Performance of Decision Random Forest:
accuracy: 82.14% +/- 1.92% (mikro: 82.14%)
root_mean_squared_error: 0.392 +/- 0.025 (mikro: 0.393 +/- 0.000)
normalized_absolute_error: 1.017 +/- 0.117 (mikro: 0.993)
Avg. Class Precision: 86.3%
Avg. Class Recall: 40%
Performance of Neural Network:
root_mean_squared_error: 23,815.976 +/- 4305.543 (mikro: 24211.353 +/- 0.00)
normalized_absolute_error: 0.126 +/- 0.037 (mikro: 0.122)
Performance of General Linearized Model (Default values):
root_mean_squared_error: 21,9027.497 +/- 45537.878
normalized_absolute_error: 1.017 +/- 0.117 (mikro: 0.993)

Accepted answers

JEdward

Then you are comparing like for like.

However

One caution I would give on your classification prediction is to think about how your classification model is measured against misclassifications.

Imagine you have a numerical label with values 1 to 10.

After binning your label has the following nominal values.

Value 1: 1-3

Value 2: 4-6

Value 3: 7-9

Value 4: 10

All comments

JEdward

Then you are comparing like for like.

However

One caution I would give on your classification prediction is to think about how your classification model is measured against misclassifications.

Imagine you have a numerical label with values 1 to 10.

After binning your label has the following nominal values.

Value 1: 1-3

Value 2: 4-6

Value 3: 7-9

Value 4: 10

mylane3

Thanks, I eventually realized that it really is like trying to compare fruit and vegitables.