Error when applying a trained model to a new unlabeled data set
Stann
New Altair Community Member
I want to apply a Naive Bayes model to a new (unlabeled) data set. The model has already been trained and tested via cross-validation. However when I try to apply the model to a brand new data set I get an error message.
Here is an overview of my process and the error I get:
The "Retrieve aggregate" is the new (unlabeled) data set, which I want to predict using my trained model.
"Process Documents from Data" contains a "Tokenize" operator.
The subprocesses within the Cross Validation operator are:
I am new to RapidMiner and I have no clue as to why I get this error
I would greatly appreciate your help as I need to carry on with my research
Here is an overview of my process and the error I get:
The "Retrieve aggregate" is the new (unlabeled) data set, which I want to predict using my trained model.
"Process Documents from Data" contains a "Tokenize" operator.
The subprocesses within the Cross Validation operator are:
I am new to RapidMiner and I have no clue as to why I get this error
I would greatly appreciate your help as I need to carry on with my research
0
Best Answer
-
@Stann,
Yes it is possible :
As said apply the same preprocessing steps in your test set "branch"
and connect the word output (wor) of Process Documents from Data operator of your training "branch" to the word input (wor) of your Process Documents from Data of your test set branch.
Regards,
Lionel1
Answers
-
Hi @Stann,
The attributes have to be strictly the same in your training set and in your unlabeled test set.
Thus you have to apply strictly the same preprocessing steps to your unlabeled test set (thus you have to apply
Nominal to text and Process Documents from data operators to your test set) . Currently you are applying the raw test set to your model...
Hope this helps,
Regards,
Lionel0 -
@lionelderkrikor, @ceaperez thank you for your quick response.
Having the exact same attributes would be impossible as each attribute is a token (word) which appeared in the initial text document. Since the new (unlabeled) data set contains different text documents as the training set, the attributes would always differ, because the text documents in the new data set are comprised of "new" tokens.
Having said that, is there still a way to apply the model to a new (unlabeled) set?0 -
@Stann,
Yes it is possible :
As said apply the same preprocessing steps in your test set "branch"
and connect the word output (wor) of Process Documents from Data operator of your training "branch" to the word input (wor) of your Process Documents from Data of your test set branch.
Regards,
Lionel1