"LearningCurve edit: Training ratio bugged?"

Question

Dear All, How to make a smooth learning curve? Which shows the averaged result over many runs? edit: The training ratio, the ratio which shall be maximally used for training, doesn't seem to work. When looking in the results the max fraction is 0.95, and training_ratio was set to 0.2. When changing training ratio to 0.6, nothing changes! edit: I found 07_meta\04_LearningCurve.xml I modified this xml as following: But this is not giving the results I want. The learning curve is way to chaotic! Seems results do not get average over different runs: http://student.science.uva.nl/~wluijben/learning_curve_in_need_of_smoothing.jpg Old question: How can I make a Learning Curve? Lets say I have a dataset of 100 examples. I wish to split this data in 10 folds each. In normal cross-validation, there will be 10 runs: training on 9 folds and testing on 1. Which result in 1 result average + standard deviation. Now I wish to do do an extra iteration inside each run: Which varies the amount of folds used for training. So this should result in N result averages + Nstandard divinations for each amount of folds used. (Preferably it should output the amount of training data used, not the amount of folds) Regards, Wessel

wessel · Answer

Thank you so much.
This is completely what I wanted.

There is a strange anomaly though, which is hidden from your screenshot, because you use example filter: fraction >= 0.1.
At fraction 0.05, using only 4 training examples, performance is better then using 100 training examples!
Why is the performance of fraction 0.05 so good?

Regards,

Wessel

IngoRM · Answer

Hi, this is exactly what happens. First the data is divided into two parts according to the parameter "training_ratio" of the learning curve operator. Then from this part the different frations are taken for training while the test data is kept constant. Just try to add additional logging like in the process below or work with breakpoints and you can exactly what happens: Cheers, Ingo

wessel · Answer

Woa, very nice!
Thanks so much!

Is there any way to reconstruct how many training examples and testing examples were used inside an iteration?

I ask this question because I fear "Using the rest" for testing is not fair.
You first want to split the data into train / test set
then for each fraction, use train on only a fraction of the training set, thus keeping the test set constant.

Where can I see the java code for LearningCurve?

edit: by setting a breakpoint inside model applier I can see the amount of training / test examples used.
training set looks constant..
The operator information on LearningCurve is really confusing!

Regards,

Wessel