Error when applying a trained model to a new unlabeled data set

Stann
Stann New Altair Community Member
edited November 5 in Community Q&A
I want to apply a Naive Bayes model to a new (unlabeled) data set. The model has already been trained and tested via cross-validation. However when I try to apply the model to a brand new data set I get an error message.

Here is an overview of my process and the error I get:


The "Retrieve aggregate" is the new (unlabeled) data set, which I want to predict using my trained model.

"Process Documents from Data" contains a "Tokenize" operator.

The subprocesses within the Cross Validation operator are:


I am new to RapidMiner and I have no clue as to why I get this error :(
I would greatly appreciate your help as I need to carry on with my research :)

Best Answer

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    @Stann,

    Yes it is possible :

    As said apply the same preprocessing steps in your test set "branch"

    and connect the word output (wor) of Process Documents from Data  operator of your training "branch" to the word input (wor) of your Process Documents from Data of your test set branch.

    Regards,

    Lionel

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @Stann,

    The attributes have to be strictly the same in your training set and in your unlabeled test set.
    Thus you have to apply strictly the same preprocessing steps to your unlabeled test set (thus you have to apply
    Nominal to text and Process Documents from data operators to your test set) . Currently you are applying the raw test set to your model...

    Hope this helps,

    Regards,

    Lionel 
  • Caperez
    Caperez Altair Community Member
    Hi @Stann,

    It seems that the name of Attributes (columns) in your Train dataset and Test dataset, aren't the same.
    please verify the name and type of your test dataset.

    Best
  • Stann
    Stann New Altair Community Member
    @lionelderkrikor, @ceaperez thank you for your quick response.

    Having the exact same attributes would be impossible as each attribute is a token (word) which appeared in the initial text document. Since the new (unlabeled) data set contains different text documents as the training set, the attributes would always differ, because the text documents in the new data set are comprised of "new" tokens.

    Having said that, is there still a way to apply the model to a new (unlabeled) set?
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    @Stann,

    Yes it is possible :

    As said apply the same preprocessing steps in your test set "branch"

    and connect the word output (wor) of Process Documents from Data  operator of your training "branch" to the word input (wor) of your Process Documents from Data of your test set branch.

    Regards,

    Lionel