"How to set Weights on Iris Data Set?"

geb_hart
geb_hart New Altair Community Member
edited November 2024 in Community Q&A
I tried some different Algorythms on the Iris sample Data set and get around 96% Accuracy.
However the AutoModel gets 100% and i think this comes with the use of weights!?
Unfortunatly I'm not able to reproduce the process from the open process!

Can somone show me how to implement "weight by Correlation" for polynominal Data?

Thx,
Sebastian
Tagged:

Best Answer

  • IngoRM
    IngoRM New Altair Community Member
    edited December 2018 Answer ✓
    Hi,
    Please note that the outer validation (including everything from model building, parameter optimization, feature engineering etc.) is NOT a full k-fold cross validation.  This is prohibitive in terms of runtimes (since it would blow up all runtimes by a factor 5x to 10x and our research has shown that users are not willing to wait for this).
    Instead, we introduced in 9.1 a multiple hold-out set approach plus a robust average calculation (removing the outliers before building the average value).  While this is not as perfect as a full-blown cross validation, it gets close and keeps runtimes at an acceptable level.  But you can still be lucky with some of the splits.  This is by the way also true for cross-validation.  However, specifically for Iris, the problem is that some of the data points with different classes are actually overlapping which means that with a full cross-validation you will never reach 100% while with a random split of 40% or so for the validation set you may actually end up where this overlap is not problematic.
    If you want to learn more about the validation topic please also check out this white paper here:
    We recently have updated it a bit to better explain why cross-validation is great if possible / feasible, the core aspect of correct validation actually is to validate ALL model optimizations.  We use the multiple hold-out set approach described above for this.
    Hope this helps,
    Ingo

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @geb_hart,

    I have a different hypothesis : 
    This performance of 100% (accuracy) is due to "luck" from my point of view . In deed, by defaut the "Auto-Model" tool
    performs a Split Validation with a ratio Training/Test = 0.8 / 0.2. So the performance is calculated on 20 % of the dataset (so 30 examples for the Iris dataset), if the sampling is "lucky" all the test examples are correcty classified which explains this performance.
    To convince you, you can : 
     - set an other "local random seed" for the sampling of the training/Test partition  . For example here the results with local random seed = 1991


     - 
    decrease the ratio training/Test in the Split Data (split of a validation set) operator. In this case, there are more test examples and there is less "luck" to have all the test examples correctly classified . Here the results with ratio Train/Test of 0.7/0.3  (and local random seed = 1992) : 


    As beta tester, I was amazed by the RapidMiner's Studio owner, that Auto-Model don't perform Cross-Validation (instead Split Validation).
    A priori with a Cross Validation, these kind of "perfect results" are impossible...
    So is there any reason to perform Split Validation instead Cross-Validation in this tool (maybe time of computation..?).

    And to conclude, the moral of this story is that "..in Data-science (and maybe more generally in the life), there are those who are lucky and ....
    the others...."

    I hope it helps,

    Regards,

    Lionel
  • Telcontar120
    Telcontar120 New Altair Community Member
    I agree with everything @lionelderkrikor says about cross validation above.
    As far as your other question goes, there is no (sensible) way to use Weight by Correlation for polynominal data.  You could either look at another weighting approach (such as Weight by Information Gain) or you would have to transform all your data into binominal 0/1 flags and then calculate numerical correlations.  But in neither case will using Weight... operators improve your model performance to 100%!
  • sgenzer
    sgenzer
    Altair Employee
    cc'ing @IngoRM about Split vs Cross Validation in Auto Model.
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    RM Staff,

    I just updated the RapidMiner with the 9.1 "official release" and tested rapidly the Auto-Model tool : 
    I wanted to warmly welcome the introduction of Cross-Validation inside Auto-Model and I must admit that there is an impressive work on this release.

    Regards,

    Lionel

  • geb_hart
    geb_hart New Altair Community Member
    I also tested the auto Model on the Iris Data in the 9.1 release.. and still get 100% with 3 of seven Models

    and still belief that weights play a role in them, but not reproducabel for me



    Please try it for yourself and if you could rebuild the process for GLM or SVM I would like to see it :)

    Thx for your Comments!!

  • IngoRM
    IngoRM New Altair Community Member
    edited December 2018 Answer ✓
    Hi,
    Please note that the outer validation (including everything from model building, parameter optimization, feature engineering etc.) is NOT a full k-fold cross validation.  This is prohibitive in terms of runtimes (since it would blow up all runtimes by a factor 5x to 10x and our research has shown that users are not willing to wait for this).
    Instead, we introduced in 9.1 a multiple hold-out set approach plus a robust average calculation (removing the outliers before building the average value).  While this is not as perfect as a full-blown cross validation, it gets close and keeps runtimes at an acceptable level.  But you can still be lucky with some of the splits.  This is by the way also true for cross-validation.  However, specifically for Iris, the problem is that some of the data points with different classes are actually overlapping which means that with a full cross-validation you will never reach 100% while with a random split of 40% or so for the validation set you may actually end up where this overlap is not problematic.
    If you want to learn more about the validation topic please also check out this white paper here:
    We recently have updated it a bit to better explain why cross-validation is great if possible / feasible, the core aspect of correct validation actually is to validate ALL model optimizations.  We use the multiple hold-out set approach described above for this.
    Hope this helps,
    Ingo

  • M_Martin
    M_Martin New Altair Community Member
    Colleagues: a very interesting conversation, and I particularly interesting (and also somewhat worrisome) is the fact that RapidMiner marketing experience seems to indicate that users have a low patience threshold - this is a bottle of wine conversation topic in and of itself.  Best wishes, Michael Martin