Split-Validation Issue

amitd
amitd New Altair Community Member
edited November 5 in Community Q&A
@sgenzer, I believe there may be some issue with the split-validation operator. The model output through the entire split-validation process does not correspond to the model with which the validation performance metrics are computed. 
I have attached an Excel spreadsheet to show the computations with a formula. The RMSE computed for the validation dataset (using the Performance operator) corresponds to the "ValidModel and ApplyModel" (in Excel worksheet) which is one of the models output by the process when dissected through a remember/recall operators and breakpoints. However, the RapidMiner process outputs a LinearRegression model that is same as the "TrainModel" (in Excel worksheet) whose RMSE does not match the one given by the Performance (Regression) operator. Why the discrepancy? Which is the correct model here?

I have tried this issue with multiple datasets and have documented it in a process with the sample Polynomial dataset. Any ideas on what may be going on here? 

Best Answer

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    edited November 2020 Answer ✓
    Hi @avd,

    I will try to explain this difference (thanks to the RM staff to correct me if I'm wrong) : 
    There are in deed 2 models built in this process : 

    The first one is built with 60% of the data and then is tested with the 40% of remaining data : You called this model ""ValidModel and ApplyModel". This model is used to calculate the performance of this first model on unseen data, performance given by the Performance (Regression) operator.

    However the model delivered at the output mod of the Split Validation (called "TrainModel" in your case) operator is built with 100% of the input example set : Thus it is a different model from the first one described above. You can check this by reading the help section of the Split Validation operator : 

    "Output Model : 
    The training subprocess must return a model, which is trained on the input ExampleSet. Please note that the model built on the complete input ExampleSet is delivered from this port"

    To sum up, the "ValidModel and ApplyModel" is built with 60 % of the input example set  and the "TrainModel" is built with 100 % of the input example set. Thus this second model has a different performance from the first one because it is a different model...

    Hope this helps,

    Regards,

    Lionel 

    NB : Sometimes, what you called "TrainModel" is called "Production model" (built with 100% of the data)

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    edited November 2020 Answer ✓
    Hi @avd,

    I will try to explain this difference (thanks to the RM staff to correct me if I'm wrong) : 
    There are in deed 2 models built in this process : 

    The first one is built with 60% of the data and then is tested with the 40% of remaining data : You called this model ""ValidModel and ApplyModel". This model is used to calculate the performance of this first model on unseen data, performance given by the Performance (Regression) operator.

    However the model delivered at the output mod of the Split Validation (called "TrainModel" in your case) operator is built with 100% of the input example set : Thus it is a different model from the first one described above. You can check this by reading the help section of the Split Validation operator : 

    "Output Model : 
    The training subprocess must return a model, which is trained on the input ExampleSet. Please note that the model built on the complete input ExampleSet is delivered from this port"

    To sum up, the "ValidModel and ApplyModel" is built with 60 % of the input example set  and the "TrainModel" is built with 100 % of the input example set. Thus this second model has a different performance from the first one because it is a different model...

    Hope this helps,

    Regards,

    Lionel 

    NB : Sometimes, what you called "TrainModel" is called "Production model" (built with 100% of the data)
  • amitd
    amitd New Altair Community Member
    Thank you, that makes sense. Ideally, it would been better to get direct access to the model fit with the training data which is being used for evaluation on the validation partition.