FeatureSelect/RemoveUselessFeatures and Holdout Sets

stereotaxon
stereotaxon New Altair Community Member
edited November 5 in Community Q&A
Hi,

I have a large number of attributes and I want to use them through various feature selection routines. 

My experiment is set up to do cross-validation on a training set and then I read in my holdout set and apply the model to those items.

This doesn't work when I run the feature selection algorithms though.  It appears that I have to have the same dataset structure in the holdout dataset as in the training dataset.  Since the variables are being deleted automatically, I don't know how to get my datasets to match.

Am I doing this wrong?

Is there a way to read in a train UNION holdout dataset, filter the holdout cases, do my model fitting, then filter the training cases and apply my model?

Thanks for your help.
Mike
Tagged:

Answers

  • TobiasMalbrecht
    TobiasMalbrecht New Altair Community Member
    Hi Mike,

    unfortunately I did not quite understand what exactly you want to do and in which order you want to do the steps you mentioned ... Do you want to do a cross-validation on the training set, then a validation on the holdout set? Then why do you want to do the cross-validation at all? Or do you want to incorporate a cross-validation inside the feature selection to determine the best features, learn a model and then test the model on the holdout set? Maybe you can clarify my confusion by posting a sample process XML?

    Anyway, addressing the different data structures between the training and the holdout set involving a feature selection during training: every feature selection scheme outputs an AttributeWeights object which holds the information which attributes are selected and which ones are deselected by the feature selection. You may store this AttributeWeights object, load it afterwards and use the AttributeWeightsSelection operator to select the features of the holdout set according to the specification in the AttributeWeights object.

    Hope that helps, if not please try to explain your procedure a little bit more detailed or post your process XML.
    Regards,
    Tobias
  • Legacy User
    Legacy User New Altair Community Member
    Hi Tobias,

    Thanks for your help.  My goal of all of this is that I (now) just want to use feature selection, fit a model to my the reduced dataset, apply that model to a holdout set, and write the predictions to a file.  It's not working though. The problem I'm having is that RapidMiner is using the wrong variables when applying a model after featureSelection.  I suspect it's applying by order as opposed to by name?

    For example, after feature selection, I have three variables.  If I apply a linear regression model that uses all of the variables, the model applier works.

        Var1 Var2 Var3     Intercept
    Value 0.319 0.406 19.104
    Coeff 0.868 -0.824 0.722
    V*C 0.277 -0.335 13.787         -9.924     =  3.805  <-- prediction

    However, when I use a learner such as W-SimpleLinearRegression that will produce a model with only 1 variable, my predictions are incorrect. 

    For example, the input values are the same, but coeff[3] and the intercept have changed, so I should get the prediction of 6.517.

        Var1 Var2 Var3             Intercept
    Value 0.319   0.406 19.104
    Coeff                   0.930
    V*C                   17.767 -11.250 = 6.517 <- correct- prediction


    but that's not what I'm getting.  It seems that RM is  using var1's value of .319 instead of var3s value of 19.104 when applying the model.
        Var1 Var2 Var3             Intercept
    Value 0.319 0.406 19.104
    Coeff 0.930
    V*C 0.297                           -11.250 -10.953 <-- what I'm getting

    So, to summarize.  FeatureSelection seems to be confusing the model applier, and me.  It doesn't seem to be using the right variable when making preditions. 

    What am I doing wrong?

    Thanks,
    Mike
  • IngoRM
    IngoRM New Altair Community Member
    Does this also occurs for the learner LinearRegression, i.e. the non-Weka version?

    Cheers,
    Ingo