how do i get better predictions

LeMarc · April 2020

Hi,

Im using a regression model to predict sales values (label attribute) in order to select those data points where sales values could be potentially wrong. This is defined by deviations between predicted value and original value.

However some predictions are quite close to the original values and some the error rate is above 50 % within the same (artificial) data set. Using a forecasting model (e.g. ARIMA) does not make sense to me, since im not trying to forecast future values for another example set. But rather trying to check if sales values are wrong or right/flag as potentially wrong.

So I was thinking could prediction of the sales value be quite different, because the data is not based on real data?

Does anyone have a suggestion on how to recheck sales values otherwise with supervised learning methods?

Thank you!

varunm1 · April 2020

You need to stick with one validation type when you try to compare models. I see that in 1. you used 80:20 and in 4 you used 90:10. Trying to improve or study model performance based on split ratios is not recommended. You get different results using different split ratios, this is due to different data samples present in 1. and 4. Also, use random seed when you are using split operators so that every time you run, you get similar results.

Second, I strongly recommend you validate models using cross-validation instead of random splits. This is not to get a highly accurate model but to get stable and reliable models.

RMSE is an error (lower error better model) and it is not in percentage values. RMSE is based on residuals (deviation between predicted and true value). So these values are similar to the values in your label, it means the units are the same. For example, if you are trying to predict number of packages sold and the RMSE is 15, it means the quad mean of errors is 15 packages.

Feature generation is a concept where you can generate multiple features from existing attributes. You can use "Automatic Feature engineering" operator or you can also do it on your own based on knowledge (domain) using generate attributes. The below link gives you an idea.
https://rapidminer.com/blog/data-prep-feature-generation-selection/

Try implementing feature selection, generation (if possible), dimensionality reduction (if needed based on the number of attributes and your samples), Optimize your models by optimizing hyperparameters (Optimize Parameter (Grid)) and cross-validate.

LeMarc · April 2020

I used a regression model with an example set of sales Data from the internet. The predictions here are quite close to the actual values. Time frame included several years though.

In my case Im just checking the sales value for a single month.

MartinLiebig · April 2020

Hi @LeMarc ,

this boils down to a general "how do i get better predictions" question... What model did you use?

Best,

Martin

LeMarc · April 2020

Thanks @mschitz! I changed my question according to your suggestion.

I tried several different models of Prediction models available in Rapid Miner e.g. DT, RF, GBT,DL etc. Just experimenting ,without optimizing parameter though.

Edit: Optimizing Parameters & Stacking does not improve performance

Decision Tree seems to be the best so far. However since its a task for management accounting the predicted values and actual values should be quite close if there is no mistake in the actual values.

MartinLiebig · April 2020

It is very surprising that a DT is better than a GBT. If a DT works, a RF is usually always better.

That make me suspicious..

varunm1 · April 2020

Edit: Optimizing Parameters & Stacking does not improve performance

Does not or Did not? Generally, optimization improves performance unless the default parameters in operators are the best for this data.

Also, how did you build your models? Did you use any feature selection or generation?

Did you check correlations between the predictors and outcomes? We can get some idea based on that as well.

LeMarc · April 2020

Unfortunately I cant share my data set. But I can show my models .

Image: https://us.v-cdn.net/6030995/uploads/editor/qq/6l1rcadg7ha0.png

(1) Parameters are set as default. Ratio 0,8/0,2. (100 Examples)

RMSE are as followed:

DT = 18788.462 +/- 0.000

GBT = 15756.644 +/- 0.000

 RF = 12021.061 +/- 0.000

@mschmitz you are right. It does make sense, that a RM should be better than a simple decision tree as data above shows.

2) According to the last model with the loop parameter the RMSE is like this (settings are the same as in (1)):

DT = 7930.069

GBT = 11496.235

RM = 12440.348

I dont understand why DT has the lowes RMSE now.

(3) I also tried the auto model and the RMSE KPI looks like this:

GLM = 6994.636 +/- 2003.916 (micro average: 7203.218 +/- 0.000)

DT = 10789.033 +/- 4282.052 (micro average: 11512.035 +/- 0.000)

RM = 8101.997 +/- 2561.472 (micro average: 8427.034 +/- 0.000)

So basically result is similar to the first one in regards to which model works has the lower RMSE.

(4) The ratio was changed to 0,9/0,1.

DT = 1550.679 +/- 0.000

GBT = 9779.131 +/- 0.000

RF = 6380.126 +/- 0.000

Now DT has the lowest RMSE. But why?

@varunm1 & @mschmitz It did not work. I did not used any feature selection. Correlation matrix didnt show any interesting result, since there is no real pattern behind the sales values due to artificial data set. What do you mean by GENERATION?

The model with the lowest RMSE should be chosen right? If RMSE is e.g. 1550.679 is it 15,5 % ? - Im a little bit confused how to read the numbers.

Something more I dont understand: when using Deep Learning to predict , the performance changes every time the "start execute" button is pressed though nothing else changes.

Thank you for the help!

varunm1 · April 2020

You need to stick with one validation type when you try to compare models. I see that in 1. you used 80:20 and in 4 you used 90:10. Trying to improve or study model performance based on split ratios is not recommended. You get different results using different split ratios, this is due to different data samples present in 1. and 4. Also, use random seed when you are using split operators so that every time you run, you get similar results.

Second, I strongly recommend you validate models using cross-validation instead of random splits. This is not to get a highly accurate model but to get stable and reliable models.

RMSE is an error (lower error better model) and it is not in percentage values. RMSE is based on residuals (deviation between predicted and true value). So these values are similar to the values in your label, it means the units are the same. For example, if you are trying to predict number of packages sold and the RMSE is 15, it means the quad mean of errors is 15 packages.

Feature generation is a concept where you can generate multiple features from existing attributes. You can use "Automatic Feature engineering" operator or you can also do it on your own based on knowledge (domain) using generate attributes. The below link gives you an idea.
https://rapidminer.com/blog/data-prep-feature-generation-selection/

Try implementing feature selection, generation (if possible), dimensionality reduction (if needed based on the number of attributes and your samples), Optimize your models by optimizing hyperparameters (Optimize Parameter (Grid)) and cross-validate.

LeMarc · April 2020

@varunm1

Thank you for the input, the link and your help. It is appreciated.

how do i get better predictions

Welcome!

Best Answer

Answers

Welcome!

Welcome!

Quick Links

Categories