"[SOLVED] Using tf/idf for text stream classification"
Hi gurus,
I am doing text stream classification and I have notice an issue that makes me confused:
In classic classification problems, I usually do like the following:
1- First divide the data into train set and test set and then use a tf/idf weighting on train set and then train a model on the train set.
2- Subsequently, I use again tf/idf on the test set and apply the model on the weighted test data set.
But right here, in the second step, a data mining-related issue is emerged because of using tf/idf weighting:
As the first experience, I applied the tf/idf on a test set consisting of 10 instances and in the second experiment I applied the tf/idf on a test set containing 20 instances. As you might guess, the results differ significantly. Consequently, applying the model on the first test data set would lead to different results than second data set!
Would anyone please explain what is the right solution? In the other words, in dealing with data streams, shall I use tf/idf at all? If the answer is positive, how many data should I have in each of my test data sets? (Because in regard with text streams, all of the test data are not available at once and I should provide them to the model gradually.)