Model performance estimation

npapan69
npapan69 New Altair Community Member
edited November 2024 in Community Q&A
Dear All,
I have a relatively small dataset with 130 samples and 2150 attributes, and I want to built a classifier to predict 2 classes. Apparently, I need to reduce the number of attributes to avoid overfitting, so I could use i.e. RFE-SVM to reduce the number of attributes to 1 tenth of my samples, which is 13. I'm using a Logistic Regression model, and I need to do some fine tuning of parameters like lambda and alpha. After reading the very informative blog from Ingo, I would like some help on the practical implementation. May I kindly ask from a more experienced member to check the following workflow? Can I trust this implementation and in particular the performance estimates? Is it a good practice to compare the performance from CV with that from a hold-out single set? And if yes these numbers should be more or less the same?



Many thanks in advance,

npapan69

Best Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    Cross validation is generally believed to be more accurate than a simple split validation.  Split validation measures performance based only one one random sample of the data, whereas cross-validation uses all the data for validation.   Think about it this way---the hold-out from a split validation is simply one of the k-folds of a cross-validation.  It's inherently inferior to taking multiple holdouts and averaging their performance, which provides not only a point estimate but also a sense of the variance of the model performance as well.

    It's different if you have a totally separate dataset (sometimes called an "out of sample" validation, perhaps from a different set of users, or different time period, etc.) that you want to test your model on after the initial construction.  In that case your separate holdout might provide additional insight into your expected model performance on new data.  But in a straight comparison between split and cross validation, you should prefer cross validation.
  • rfuentealba
    rfuentealba New Altair Community Member
    edited November 2018 Answer ✓
    On another note:

    Now that my sensei @Telcontar120 mentions it, you have two files: one is filename75.csv and the other is filename25.csv, right? (say yes even if it's not the same name).

    If you did that because you want to replace the filename25.csv file with data coming from elsewhere, the process you wrote (and then, the process I sent you) is fine. If you did the split because your target is to prepare that model and perform a split validation after a cross validation, that's not really required. It's safe to use Cross Validation as a better thought Split Validation (until Science says otherwise, but that hasn't happened). In that case, your question:
    Should I trust one or the other?
    Be safe trusting the Cross Validation.

    In the case I sent, I assume that your testing data is new data that comes from outside your sample. A good case to do that is what happened to me in my oceanic research project:
    • Trained my model with a portion of valid data from 2015 and 2016.
    • Tested my model with a portion of valid data from 2015 and 2016, but different chunk of it.
    • Then I have data between 2009 and 2014 that is outside of my sample and I want to score it.
    My question is: should I use a new performance validator? 
    • If what I want to validate is how my algorithm behaves, then no, one validation is enough.
    • If what I want to validate is the way historical data has been scored, then yes, you might see if your algorithm holds against older data: one validator for the model and other for the old data on applied model data.
    • Everything else, no.
    So, rule of thumb: if what's important is the model, go with Cross Validation. if it's historical data that is also scored, perform the validation yourself. If it's new data, don't validate anything, because your new data will be predicted true, not really true and validations ALWAYS  come from data you already know.

    Hope this helps.

Answers

  • rfuentealba
    rfuentealba New Altair Community Member
    Hello, @npapan69

    I got a bit lost reading your description, let me self-explain like I'm 5.

    You have 130 samples or horizontal rows. Each sample has 2150 attributes or vertical columns. And you want them to fit in 2 classes on a label. Am I right?

    In that case, holy moly... your data is EXTREMELY prone to overfitting and I would run in circles before doing something like that again (someday I'll tell you my story with @grafikbg). There is a massive number of possible combinations to make them fit into your classes, and there is little chance that none of these 2150 attributes is correlated to another. If you want to continue, the first thing you should do is to either remove the correlated attributes or select the most important ones.

    What has me confused is that you later explained that you can use SVM-RFE to remove attributes to a tenth, so 13. Am I right? Can the story be that you have 2150 samples or horizontal rows, and each sample has 130 attributes or vertical columns? I would still do the same, remove the correlated attributes and only then apply SVM-RFE, as you said. In fact, SVM-RFE doesn't behave well when there are too many correlated attributes, and 130 is still too large of a number, so there may be some correlations that might not be identifiable at first sight by the bare human eye.

    I would save the results of this operation in the repository before continuing with the logistic regression and whatever you want, but at least the data preparation phase would be ready at this point, and you can take advantage of the Optimize Parameters super-operator to do your fine tuning. Regarding your questions:

    Q: May I kindly ask from a more experienced member to check the following workflow?
    A: I can't fire RapidMiner Studio right now but I promise I will take a look as soon as I finish with my massive thing (it's almost midnight here in Chile).

    Q: Can I trust this implementation and in particular the performance estimates?
    A: What you are planning to do seems correct, but I would still take correlations out of the rule before saying it is.

    Q: Is it a good practice to compare the performance from CV with that from a hold-out single set? If yes these numbers should be more or less the same?
    A: My level of English isn't that good. Let's see if I win the lottery with this explanation: Can a Cross-Validation be trusted? Yes, but the amount of data required to make it trustable depends on how variable is your data. Take your data after preparation and perform a few X-Means clusters to get a good grasp on your data variability (or is it variety? I'm sleepy).

    I am keeping my promise of checking the process.

    Hope this helps,

    Rodrigo.
  • Maerkli
    Maerkli New Altair Community Member
    Rodrigo, it is brillant!
    Maerkli
  • sgenzer
    sgenzer
    Altair Employee
    @Maerkli if you like pls use new "reaction" tags: Promote, Insightful, Like, Vote Up, Awesome, LOL :smile:
  • npapan69
    npapan69 New Altair Community Member
    Dear Rodrigo,

    Thank you for taking the time to respond in detail in my post. Let me clarify, in the -omics sector (on which I'm working) it is very common to have far fewer samples (horizontal entries), than attributes (vertical entries) or features. Therefore various methods are recruited to cone down to the few most informative features that will comprise the -omics signature. In the xml file you will see that apart from RFE I'm removing highly correlated features, as well as features with zero or near-zero variance (useless features). As a rule of thumb someone could consider to use for every feature that will finally contribute to the model at least 10 samples. So given the 130 samples available I'm not suppose to exceed 13 features after the feature reduction techniques applied. Actually by watching Ingo's webinar, I will try the evolutionary feature selection techniques keeping the maximum number of features to be 13. Now the most important part for me is how to validate the model. In our field external validation is considered as the most reliable technique, however, its not very easy to get external data. So if I dont have external data, is it correct to start with a data splitting before doing anything else and to keep 25% of the data, as a hold out test set, train and save my model and afterwards test it with the hold-out set? Or forget about splitting and report (and trust) CV results? Is there a way to do repeated cross validation (like 100 times for example)?

    Again many thanks for your time and greetings from Lisbon to the beautiful Chile.

    Nikos 
  • npapan69
    npapan69 New Altair Community Member
    Many thanks Rodrigo for taking the time to answer in such a detailed way my post. In the -omics field that I'm working its very common to have few samples and way too many attributes, therefore feature selection methods are very important to reduce overfitting. In my feature selection approach (as you will see in my process) I start by removing useless and highly correlated features and then apply RFE-SVM. As a rule of thumb the maximum number of features that will finally comprise the model (signature) should not exceed the 1/10 of the total number of samples used to train the model. Now the question is if my approach using a nested cross validation operator to select features, train and fine tune the model using 75% of the samples while testing the performance with the 25% of samples test hold out set is correct. And if yes the difference in my performance metrics (accuracy, AUC, etc) between the CV output and my test data output should be minimal? If not is that a sign of overfitting? Should I trust one or the other? Should I verify the absence of overfitting by comparing the 2 outputs?

    Nikos
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    Cross validation is generally believed to be more accurate than a simple split validation.  Split validation measures performance based only one one random sample of the data, whereas cross-validation uses all the data for validation.   Think about it this way---the hold-out from a split validation is simply one of the k-folds of a cross-validation.  It's inherently inferior to taking multiple holdouts and averaging their performance, which provides not only a point estimate but also a sense of the variance of the model performance as well.

    It's different if you have a totally separate dataset (sometimes called an "out of sample" validation, perhaps from a different set of users, or different time period, etc.) that you want to test your model on after the initial construction.  In that case your separate holdout might provide additional insight into your expected model performance on new data.  But in a straight comparison between split and cross validation, you should prefer cross validation.
  • rfuentealba
    rfuentealba New Altair Community Member
    edited November 2018 Answer ✓
    On another note:

    Now that my sensei @Telcontar120 mentions it, you have two files: one is filename75.csv and the other is filename25.csv, right? (say yes even if it's not the same name).

    If you did that because you want to replace the filename25.csv file with data coming from elsewhere, the process you wrote (and then, the process I sent you) is fine. If you did the split because your target is to prepare that model and perform a split validation after a cross validation, that's not really required. It's safe to use Cross Validation as a better thought Split Validation (until Science says otherwise, but that hasn't happened). In that case, your question:
    Should I trust one or the other?
    Be safe trusting the Cross Validation.

    In the case I sent, I assume that your testing data is new data that comes from outside your sample. A good case to do that is what happened to me in my oceanic research project:
    • Trained my model with a portion of valid data from 2015 and 2016.
    • Tested my model with a portion of valid data from 2015 and 2016, but different chunk of it.
    • Then I have data between 2009 and 2014 that is outside of my sample and I want to score it.
    My question is: should I use a new performance validator? 
    • If what I want to validate is how my algorithm behaves, then no, one validation is enough.
    • If what I want to validate is the way historical data has been scored, then yes, you might see if your algorithm holds against older data: one validator for the model and other for the old data on applied model data.
    • Everything else, no.
    So, rule of thumb: if what's important is the model, go with Cross Validation. if it's historical data that is also scored, perform the validation yourself. If it's new data, don't validate anything, because your new data will be predicted true, not really true and validations ALWAYS  come from data you already know.

    Hope this helps.
  • npapan69
    npapan69 New Altair Community Member
    edited November 2018
    Again many thanks Rodrigo, for your enlightening answer, and the time devoted to correct my process.
     
    Best wishes,
    Nikos
  • npapan69
    npapan69 New Altair Community Member
    Dear Rodrigo,
    I must admit that I couldn't find a way to evaluate the training and test data variance by X-means. Probably this is very basic, and I apologise for that, but the X-means operator can receive only a single file as input, and I guess I have to provide 2 files as inputs (75% training, 25% testing). Any workarounds?

    Many thanks
    Nikos
  • rfuentealba
    rfuentealba New Altair Community Member
    Hi @npapan69

    Sure, just use the Append operator to merge both files as a single one. Make sure that most of the columns have the same names and that's it.

    All the best,

    Rodrigo.