Compare Test and Training error of SVM, decision tree and neural network
Hello everyboy,
I found this old thread (https://rapid-i.com/rapidforum/index.php/topic,721.0.html) that illustrates how to visualize the training and test error for an SVM based on the C parameter. Is ist also possible to log the errors for decsion trees and neural networks with the same log operator to display all of them in one chart? Which criterion would you use for the log operator to make it compareable?
By the way, do you know if there is an option to calculate the bias and variance of the dataset in RapidMiner without an R script?
Thanks in advance and kind regards,
Christopher
Answers
-
Hi,
i think you can do it with one Log Operator. You can also use Log to Data and join/append the 2 logs, which is a bit easier.
For Bias vs variance: Can you define how you would calc. var and bias?
~Martin0 -
Hi Martin,
thank you for the fast reply. I will try this operator, but which attributes would you log? The depth of the tree, the C of the SVM and the number of iterations of the neural network? I am not sure about that.
Thats exactly my problem. I read a few times about the Bias-Variance-Tradoff and wanted to know if it is possible to calculate variance and bias in RapidMiner automatically.
Edit: I am so interested in this topic because I think I have a completly overfitting model. I want to check the cause of the overfitting.
Best regards,
Christopher
0 -
Hi Christopher,
you will always get one single point on the trade of curve by varying the model complexity and checking the predictive quality using a cross validation. Take a look here:
http://web.stanford.edu/~hastie/ElemStatLearnII/figures7.pdf
You see two curves there, one for the training or in sample error. This happens if you apply the trained model on the data set you trained it on. The other one is, if you apply that out of sample on a test set. Best and most reliable way to do so is the cross validation operator.
Now you can run your model at various degrees of complexity. What influences the complexity depends on the algorithm itself. C parameter in linear SVM, number of neuron in Neural net, smaller k in k-NN, depth of tree in Decision Tree, etc...
Vary it using Optimize Parameters, log all the varied parameters and the resulting performance of the cross validation and there you get the trade off graph and can select the optimal point (or let it do the Optimize parameters by itself.)
Greetings,
Sebastian
1 -
I echo Sebastian's comments that the best way to avoid overfitting is to utilize robust cross-validation techniques. Otherwise any model that is simply built using all the data from a single development dataset is quite likely to incorporate overfitting. Some modeling approaches also include tuning parameters that are specifically designed to reduce overfitting, such as automatic pruning steps in decision trees.
Since variance and bias are conceptual categories that can be calculated differently depending on the type of problem, it isn't something built into RapidMiner. As you suggest, you could write your own functions to calculate your desired bias and variance, either in R or in RapidMiner.
You can find more about this topic at this helpful wikipedia article: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
2