High deviation?

For many learners (supervised learning), I get for my learning set a standard deviation (for the accuracy of 70-80%)
of about 40-50%. What can I infer from this value? Is this deviation high, i.e. the computed classifiers are too
weak? Or are these standard values? If not, what could I do to improve them in terms of reducing the standard deviation?

I'd be happy to get a detailed answer since I'm a newbie to data mining and your cool tool rapidminer. :-)

Thx.
klaus

Find more posts tagged with

AI Studio

Accepted answers

All comments

land

Hi Klaus,
if you use crossvalidation, the performance is estimated averaging the performance on a number of disjoint training and classification runs. Since you then have for example ten values of performance you might calculate the standard deviation from them.
A high standard deviation then points out, that the performance in some of the runs was much better than the average and in other much worse.

This indicates a very unstable classification result or the using of a very small trainingset. In small trainingsets (for example gene expression data with hundreds of thousands of attributes but only a small number of examples) one missclassified example more or less in a run, causes already some percent standard deviation.
If your example set contains enough examples, you should try another learning algorithm or tune the parameters in order to provide more reliable results.

Greetings,
Sebastian

Legacy User

Hello Sebastian,

A high standard deviation then points out, that the performance in some of the runs was much better than the average and in other much worse.

Do you mean by average the value that one gets for the accuracy?

Your assumption is right, the training set is relatively small, about 500 examples.I've tried
couple of algorithms and also played around with their parameters. But for most cases
I get these high deviations. Do you have any suggestions which algorithms/parameters or
pre-/post-processing steps might be promising to reduce the standard deviation?

And a general question: Are such high deviations acceptable when the accuracy, as
in my case, is relatively high (or are 80% not that good)? Or is it better for a model to
sacrifice some accuracy at the cost of a small standard deviation?

klaus

land

Hi,
500 examples is not that big, but one or two missclassified examples more or less won't cause the accuracy to deviate that much.

The standarddeviation is an indicator of the reliability of the performance estimation using cross validation. An deviation of 80% says: Don't know. Your final model trained on all data might be really bad, or really good, but you don't know and you cannot test it (since you already used all your training data).

Usually a deviation of around 4% is tolerable sometimes more, sometimes less depending, on the size of training data. So you should indeed search another learning algorithm or doing some preprocessing tasks. In many cases the real trick is in the correct preprocessing...

Greetings,
Sebastian