Automodel feedback : Debate about the models training

lionelderkrikor
lionelderkrikor New Altair Community Member
edited November 5 in Altair RapidMiner
Dear all,

I wanted friendly and humbly open a debate about the training method of the models in RapidMiner's Auto Model.
In deed, from what I understood of the "data science methodology", after evaluating and selecting the "best" model, this one has to be (re)trained with the whole initial dataset before entering in production.
This principle is also applied by the Split Validation operator : The model delivered by RapidMiner is trained with the whole input dataset (independently of the split ratio).
BUT, this is not the case in Auto Model, the model(s) provided / made available by RapidMiner's Auto Model is (are) trained with only 60 % of the input dataset.
My first question is : Is it always relevant to (re)train the selected model with the whole input dataset ?
if yes and if it is feasible  , it is maybe a good idea to implement this principle in Auto Model.(I think of users (no data-scientists /beginners) who do not want to ask questions and who just want a model to go into production...)
But maybe for a computation time constraint, (or another technical reason) it is not feasible to (re)train all the models with the whole initial dataset ?
In this case (not feasible), it is maybe a good idea to advise the user in Auto Model (in the documentation and/or in the  overview of the results and/or in the "model" menus of the differents models) to (re)train manually, by generating the process of the selected model, before it enters in production...

To conclude, I hope I helped advance the debate and I hope to have your opinion on these topics.

Have a nice day,

Regards,

Lionel

Comments

  • varunm1
    varunm1 New Altair Community Member
    Hello @lionelderkrikor

    Thanks for starting on this, I do have a question regarding this,

    1. Automodel has heavy feature engineering (model based) before training the model. So, this happens only on 60 % of dataset right now as the remaining 40% is for testing. My question is if we train model again on complete data, aren't the features selected be impacted by the 40% of data and change model dynamics? 

    Thanks
    Varun
  • sgenzer
    sgenzer
    Altair Employee
    Answer ✓
    Thank you @lionelderkrikor for this. All really good points. I am, of course, passing this to the Auto Model-er himself @IngoRM as he is the one best to participate in this discussion from our side. :smile:

    Scott

  • IngoRM
    IngoRM New Altair Community Member
    edited June 2019 Answer ✓
    Not much to add here.  The practice to retrain the model on the complete data set is most useful for small data sets, for larger ones it typically matters less (as a result of the learning curve, i.e. the fact that models reach a plateau sooner or later where more data is no longer helping that much to get to better models).
    But I do have a question: let's say AM would automatically generate scoring processes for you - which model would you actually prefer to be used?  One which is retrained on the complete data but likely behaves and even looks differently than the one shown in Auto Model or the exact one you have seen in AM?
    The reason why I ask is that I have been working on something like that and, following this best practice, actually retrained the model on the complete data.  But the first time I noticed that the model looks different (different coefficients for a GLM, for example, or even more obvious: different structures for decision trees), I was no longer sure if this is a good approach.  Would that not confuse many users?  So I actually ended up using the one shown by Auto Model so that we are not running into the problem with users saying "why is the prediction X - according to the model you show in AM it should be Y?"
    You could argue: why not showing the complete model in AM then instead?  Because then the shown model and the predictions on the validation set won't match any longer.
    We could show both but is that really better?
    You see, not really as straightforward as I originally thought so I really would appreciate your input on this...
    Thanks,
    Ingo
  • IngoRM
    IngoRM New Altair Community Member
    edited June 2019
    Thanks, your thoughts are much appreciated!  I think we probably will agree that for putting a model into production, one which was trained on the full data would be best.  However, I don't think that I made a good job when I posted my earlier question since...

    If you used cross validation, this difference would go away.

    ...is actally not the case (more below).  The problem of potential user confusion would be the same.  In fact, my believe that many users (and please keep in mind that many / most users have much less experience than you and I) will be confused is exactly coming from the fact that many people ask things like "which model is produced by cross validation".

    So let me try to explain better what problem I want to solve.  And please let's not drag ourselves into in a "pro/contra" cross validation discussion - this is an independent topic.  In that spirit, I will try to create a better explanation by actually using a cross validation example for the issue ;)

    Let's say we take a sample of 100 rows from Titanic Training (in the Sample folder).  We then perform a cross-validation and deliver the model created on all data.  Here is the model:



    I have highlighted one particular branch in the model.  If I know check the data set with all predictions, I get the following (sorted for gender and age):


    If you compare the highlighted data points with the highlighted branch in the model, the predictions are "wrong".  We of course do understand why that is the case, so that's not my point / the problem I want to solve.

    I am just looking for a good way how we can help less experienced users understand the difference and where it is coming from - hence my question.  In the split validation like it is done by Auto Model, that confusion does not exist since the created predictions and the model have a 1:1 relationship.  But with cross validation or in general with any delivered model that is trained on a different / full data set, that will happen.

    One direction I am exploring right now is to actually show TWO models in the AM results: the one which is used to calculate the error rates and predictions (like the one we have today in AM) and then a second one called "Production Model" or something like this.  Then at least the difference is explicit and can be documented.  I would hate to have a UI paradigm with some implicit assumptions which most users would not understand - this is surefire recipe for a bad user experience.

    Hope I did a better job this time explaining the potential confusion I see with the "full model" idea and please let me know if any of you have any additional thoughts...

    Best,
    Ingo

    P.S.: Here is the process I have used to create the model and predictions above:
    <?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-SNAPSHOT"><br>&nbsp; <context><br>&nbsp;&nbsp;&nbsp; <input/><br>&nbsp;&nbsp;&nbsp; <output/><br>&nbsp;&nbsp;&nbsp; <macros/><br>&nbsp; </context><br>&nbsp; <operator activated="true" class="process" compatibility="9.4.000-SNAPSHOT" expanded="true" name="Process"><br>&nbsp;&nbsp;&nbsp; <parameter key="logverbosity" value="init"/><br>&nbsp;&nbsp;&nbsp; <parameter key="random_seed" value="2001"/><br>&nbsp;&nbsp;&nbsp; <parameter key="send_mail" value="never"/><br>&nbsp;&nbsp;&nbsp; <parameter key="notification_email" value=""/><br>&nbsp;&nbsp;&nbsp; <parameter key="process_duration_for_mail" value="30"/><br>&nbsp;&nbsp;&nbsp; <parameter key="encoding" value="UTF-8"/><br>&nbsp;&nbsp;&nbsp; <process expanded="true"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="retrieve" compatibility="9.4.000-SNAPSHOT" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="repository_entry" value="//Samples/data/Titanic Training"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="sample" compatibility="9.4.000-SNAPSHOT" expanded="true" height="82" name="Sample" width="90" x="179" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sample" value="absolute"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="balance_data" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sample_size" value="100"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sample_ratio" value="0.1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sample_probability" value="0.1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <list key="sample_size_per_class"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <list key="sample_ratio_per_class"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <list key="sample_probability_per_class"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="use_local_random_seed" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="local_random_seed" value="1992"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="concurrency:cross_validation" compatibility="9.4.000-SNAPSHOT" expanded="true" height="145" name="Validation" width="90" x="313" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="split_on_batch_attribute" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="leave_one_out" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="number_of_folds" value="10"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sampling_type" value="stratified sampling"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="use_local_random_seed" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="local_random_seed" value="1992"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="enable_parallel_execution" value="true"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <process expanded="true"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.4.000-SNAPSHOT" expanded="true" height="103" name="Decision Tree" width="90" x="45" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="criterion" value="gain_ratio"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="maximal_depth" value="10"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="apply_pruning" value="true"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="confidence" value="0.1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="apply_prepruning" value="true"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="minimal_gain" value="0.01"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="minimal_leaf_size" value="2"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="minimal_size_for_split" value="4"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="number_of_prepruning_alternatives" value="3"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_port="training set" to_op="Decision Tree" to_port="training set"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Decision Tree" from_port="model" to_port="model"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_training set" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_model" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_through 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="158">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </process><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <process expanded="true"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="apply_model" compatibility="9.4.000-SNAPSHOT" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <list key="application_parameters"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="create_view" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="performance" compatibility="9.4.000-SNAPSHOT" expanded="true" height="82" name="Performance" width="90" x="179" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="use_example_weights" value="true"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_port="model" to_op="Apply Model" to_port="model"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Performance" from_port="performance" to_port="performance 1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Performance" from_port="example set" to_port="test set results"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_model" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_test set" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_through 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_test set results" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_performance 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_performance 2" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="158">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </process><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Sample" to_port="example set input"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Sample" from_port="example set output" to_op="Validation" to_port="example set"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Validation" from_port="model" to_port="result 1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Validation" from_port="test result set" to_port="result 2"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_input 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_result 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_result 2" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_result 3" spacing="0"/><br>&nbsp;&nbsp;&nbsp; </process><br>&nbsp; </operator><br></process>
  • Telcontar120
    Telcontar120 New Altair Community Member
    @IngoRM Thanks for the additional explanation.  I was focused on the difference between the delivered models in split vs cross validation output, not on the reported predictions from the test set.  And indeed you are perfectly correct (as of course you already know, but for the benefit of anyone else reading this post understanding that we are in agreement) that the predictions delivered from the "test" output from cross-validation would not be consistent with those generated from the full model delivered from that same operator.
    So having re-focused the issue in the way you have described, then I concur that the best solution is probably to present two models and associated output in AM, one for validation purposes and one for production purposes.

  • sgenzer
    sgenzer
    Altair Employee
    as the moderator FYI I'm just changing this to a 'discussion' instead of a 'question' for organizational purposes :smile:
  • IngoRM
    IngoRM New Altair Community Member
    edited June 2019
    Thanks for the feedback.  I also get more and more convinced that two models in the results are the way to go then.  So I am happy to confirm that we WILL do the following changes in future releases of Auto Model (most likely to start with 9.4):
    1. We will also build the ready-to-deploy production model on the full data set.
    2. We will show two results: Model (which is the validated one) and Production Model (which is rebuilt on the complete data).  This also allows to inspection / comparison to identify potential robustness problems.
    3. The processes will also be completely redesigned (see sneak peek below) which should help with understandability and management of those processes.
    4. Additional scoring processes using the production model and all necessary preprocessing steps will be automatically created for you if you save the results.
    We will, however, NOT change the validation method. I know that this is disappointing for some, but please see my comments on the rationale here: https://community.rapidminer.com/discussion/55220/cross-validation-or-split-validation

    The production model is independent on the validation and the results of the hybrid approach (3) in the discussion linked above are absolutely comparable with those of cross-validations in almost all situations.  But the additional runtimes - but potentially even more important the lack of maintainability for processes of that complexiy - do not make a change feasible.

    I gave this long thoughts and experimented a lot, but I am 100% convinced that this is the best way forward.  Hope you folks understand.

    Best,
    Ingo