How do I partition my data to create testing and training sets?

georgebezerra83
georgebezerra83 New Altair Community Member
edited November 5 in Community Q&A

Hi RapidMiner,

 

I'm unsure what operators I should be using but was thinking it would be split data, split, or maybe wrapper-x-validation. 

 

Thanks!

Tagged:

Best Answers

  • georgebezerra83
    georgebezerra83 New Altair Community Member
    Answer ✓

    Thank you for the response Thomas. I tried using the cross validation operator, but my data doesn't have the label attribute since they are integers. Would the split operator be more useful in this case to get the train and testing sets? I need to split the data into random equal sized train and test sets.

     

    Best,

    George Bezerra

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    You are going to need to create a label at some point if you are intending on doing modeling in RapidMiner, which seems likely since you say that you want a train and test set.  You use the "Set Role" operator for that.  It doesn't matter if the label is an integer or not since you can use many algorithms to predict numerical labels.

    Once the label is set then you can use Cross Validation.  You could also use Split Validation, although as @Thomas_Ott already said, Cross Validation is superior for many reasons.  The Split operator literally just splits your dataset into multiple chunks but it does not directly have anything to do with training or testing.

Answers

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    The simplest way is Cross Validation, it takes the entire data set and partions it into training and testing sets automatically. A more manual way (and not very good) is to use the Split operator. There you can set % of trainig and testing data from a single data source. 

  • georgebezerra83
    georgebezerra83 New Altair Community Member
    Answer ✓

    Thank you for the response Thomas. I tried using the cross validation operator, but my data doesn't have the label attribute since they are integers. Would the split operator be more useful in this case to get the train and testing sets? I need to split the data into random equal sized train and test sets.

     

    Best,

    George Bezerra

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    You are going to need to create a label at some point if you are intending on doing modeling in RapidMiner, which seems likely since you say that you want a train and test set.  You use the "Set Role" operator for that.  It doesn't matter if the label is an integer or not since you can use many algorithms to predict numerical labels.

    Once the label is set then you can use Cross Validation.  You could also use Split Validation, although as @Thomas_Ott already said, Cross Validation is superior for many reasons.  The Split operator literally just splits your dataset into multiple chunks but it does not directly have anything to do with training or testing.

  • georgebezerra83
    georgebezerra83 New Altair Community Member

    Thank you Brian, got it to work!

  • nfmohamm
    nfmohamm New Altair Community Member

    we use a sample dataset and do all necessary works before sending the data to cross validation. In cross validation, the data is split by Rapidminer itself and the final result will be out.

     

    My confusion is more on where can we plugin new data to this model once this model is ready and how ? Should I use a new Read csv file operator ? I think so as I have a new dataset just for testing. I am unclear where to plugin this data so that it will go into the model I created and show the predictions.

  • Telcontar120
    Telcontar120 New Altair Community Member

    First you need to save the model (use the Store operator) in your repository, and then you can simply use Apply Model to score a new set of data in the future.  Remember that all the same data prep would need to be done on any new records to be scored as was done for the original model development sample!