🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Model performance estimation

User: "npapan69"
New Altair Community Member
Updated by Jocelyn
Dear All,
I have a relatively small dataset with 130 samples and 2150 attributes, and I want to built a classifier to predict 2 classes. Apparently, I need to reduce the number of attributes to avoid overfitting, so I could use i.e. RFE-SVM to reduce the number of attributes to 1 tenth of my samples, which is 13. I'm using a Logistic Regression model, and I need to do some fine tuning of parameters like lambda and alpha. After reading the very informative blog from Ingo, I would like some help on the practical implementation. May I kindly ask from a more experienced member to check the following workflow? Can I trust this implementation and in particular the performance estimates? Is it a good practice to compare the performance from CV with that from a hold-out single set? And if yes these numbers should be more or less the same?



Many thanks in advance,

npapan69

Find more posts tagged with

Sort by:
1 - 3 of 31
    User: "Telcontar120"
    New Altair Community Member
    Accepted Answer
    Cross validation is generally believed to be more accurate than a simple split validation.  Split validation measures performance based only one one random sample of the data, whereas cross-validation uses all the data for validation.   Think about it this way---the hold-out from a split validation is simply one of the k-folds of a cross-validation.  It's inherently inferior to taking multiple holdouts and averaging their performance, which provides not only a point estimate but also a sense of the variance of the model performance as well.

    It's different if you have a totally separate dataset (sometimes called an "out of sample" validation, perhaps from a different set of users, or different time period, etc.) that you want to test your model on after the initial construction.  In that case your separate holdout might provide additional insight into your expected model performance on new data.  But in a straight comparison between split and cross validation, you should prefer cross validation.
    User: "rfuentealba"
    New Altair Community Member
    Accepted Answer
    Hi @npapan69,

    (My poor old Apple MacBook Air is showing signs of age, hence it took me a massive amount of time to check your process without RapidMiner hanging up, so sorry for the delay!)

    Now, quoting your last response:
    In the -omics field that I'm working its very common to have few samples and way too many attributes, therefore feature selection methods are very important to reduce overfitting.
    Yes, in oceanic research I have a similar situation: models with 240 samples and each sample contains 75 attributes, and I struggle to find the least amount of features. If you have more attributes than rows, the amount of combinations that you have to analyze is higher than the amount of samples you have, so mathematically your data accounts for a % of the truth.
    In my feature selection approach (as you will see in my process) I start by removing useless and highly correlated features... 
    Great, but in your process you are doing it inside a cross validation that is inside an optimize operators that is inside of a cross validation. I have moved these processes to the beginning to gain a bit of speed. You don't need to do that on every loop. To illustrate, I'll show you a pseudocode:
    // The cross validation operation executes everything inside it as many times
    // as chunks of code you have. 
    for each block in dataset as i:
        read the block i
        pass the block i to the optimization operator.
        // The optimization operator executes everything inside it as many times
        // as there are parameter combinations to optimize.
        for each value in min-svmrfe, max-svmrfe as j:
            for each value in min-logreg-alpha, max-logreg-alpha as k:
                for each value in min-logreg-lambda, max-logreg-lambda as l:
                    // another cross validation
                    for each block i as m:
                        read the block m
                        divide the block m in n, o:
                        model = train(j, k, l, n)
                        performance_data = test(model, o)
                        save(performance)<br>
    The problem with your approach is that your process has a lot of nesting that chops data, and a number of data preparation approaches that don't really add up to the validation process and you might find better value by adding these before you begin doing optimization and cross validation. 
    ...and then apply RFE-SVM. As a rule of thumb the maximum number of features that will finally comprise the model (signature) should not exceed the 1/10 of the total number of samples used to train the model.
    I'm not aware of the specifics of your project, so we'll go ahead with this.
    Now the question is if my approach using a nested cross validation operator to select features, train and fine tune the model using 75% of the samples while testing the performance with the 25% of samples test hold out set is correct.
    It is.
    <comment style="impostor syndrome"><br>As the only unicorn on Earth who doesn't know how to do data science properly,
    I do the same exercise with the golden ratio, 75/25 and 80/20 Pareto rule.<br></comment>
    And if yes the difference in my performance metrics (accuracy, AUC, etc) between the CV output and my test data output should be minimal? If not is that a sign of overfitting?
    It can be overfitting or underfitting. Overfitting is when your model is too trained, underfitting is when your model is trained too little. To estimate which one is it, you should examine your data first. Remember that I recommended you to use x-Means to evaluate how your data is spread? That is why. It will help you figuring out how different are your training and testing datasets.
    Should I trust one or the other?
    Use the second one to evaluate the first one, go back, retune, retest. Rinse and spin.
    Should I verify the absence of overfitting by comparing the 2 outputs?
    Yes. However, notice that the details are specific for each business case and it's up to you to decide whether your model is right or not. If you are recommending medicines, you want your model to be as perfect as you can. If you are detecting fraud, it is ok to flag outliers once you do the calculations and check these manually.

    Regardless of the specifics, it is an excellent job the one you are making. I made some corrections for you. Please find attached.

    All the best,

    Rodrigo.
    User: "rfuentealba"
    New Altair Community Member
    Accepted Answer
    Updated by rfuentealba
    On another note:

    Now that my sensei @Telcontar120 mentions it, you have two files: one is filename75.csv and the other is filename25.csv, right? (say yes even if it's not the same name).

    If you did that because you want to replace the filename25.csv file with data coming from elsewhere, the process you wrote (and then, the process I sent you) is fine. If you did the split because your target is to prepare that model and perform a split validation after a cross validation, that's not really required. It's safe to use Cross Validation as a better thought Split Validation (until Science says otherwise, but that hasn't happened). In that case, your question:
    Should I trust one or the other?
    Be safe trusting the Cross Validation.

    In the case I sent, I assume that your testing data is new data that comes from outside your sample. A good case to do that is what happened to me in my oceanic research project:
    • Trained my model with a portion of valid data from 2015 and 2016.
    • Tested my model with a portion of valid data from 2015 and 2016, but different chunk of it.
    • Then I have data between 2009 and 2014 that is outside of my sample and I want to score it.
    My question is: should I use a new performance validator? 
    • If what I want to validate is how my algorithm behaves, then no, one validation is enough.
    • If what I want to validate is the way historical data has been scored, then yes, you might see if your algorithm holds against older data: one validator for the model and other for the old data on applied model data.
    • Everything else, no.
    So, rule of thumb: if what's important is the model, go with Cross Validation. if it's historical data that is also scored, perform the validation yourself. If it's new data, don't validate anything, because your new data will be predicted true, not really true and validations ALWAYS  come from data you already know.

    Hope this helps.