🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Apply Model: Testing & Training Sets Differ

User: "Hyram"
New Altair Community Member
Updated by Jocelyn
Hi
I am using Sentiment 140 as my training and testing data. They have already split the data into two sets. I am performing training, cross validation and testing all separately. Training and CV on the training set and testing on the testing set. The problem I have is that after text preprocessing, the features in the test set don't align with those of the training set and therefore I can't apply the trained model. In text preprocessing, my end product is a matrix where texts are the examples and the features are aligned to the term frequencies which will be different for the training and test sets. 
Do I somehow merge both sets so that the features are aligned and TF = 0?
Thanks

Find more posts tagged with

Sort by:
1 - 3 of 31
    User: "Telcontar120"
    New Altair Community Member
    Accepted Answer
    The word list elements will be constrained but the TF-IDF values will be recalculated on the new sample in Process Documents.
    User: "jacobcybulski"
    New Altair Community Member
    Accepted Answer
    Be careful here, if your text processing in training uses pruning, make sure that in testing not only you use your saved word list to constrain the terms used in TF-IDF vector, as suggested by @Telcontar120, but you must switch off pruning, or else your word list may be shrunk in the pruning process thus rendering the two sets incompatible when applying the model to a test data.
    User: "jacobcybulski"
    New Altair Community Member
    Accepted Answer
    I have noticed now that you reduce dimensionality with weight-select method, in which case pass the list of weights to your testing branch, in which you do not need the weighing operator and you use the select using the weights from training.