[SOLVED] Apply IDF of training set in test

Question

Hi,

I am trying to use RM to solve a Document Classification problem. I use two different Process Document from Files. One for the test documents and one for the train documents. The problem I have is that they apply TF-IDF for each document based on the specific set. In Text classification, the creation of TF-IDF for the testing documents is performed using the IDF from the train documents.

For instance, if we only want to classify one document (using the same structure), the TF-IDF for the document should be based on the occurrences of terms in the document and the IDF previously computed based on the training collection. In the same example, if IDF is based on the test document alone all the features will become 0, as all the terms appear in all documents (one) of the test collection.

The only option I can think of is to store the IDF for the train document terms and then multiply them by the TF of the test documents but it sounds a bit like a hack. Is there any operator or some parameter I am missing?

Regards,

miguel · Answer

I apply the feature selection to both sets based on the chi squared values of the training collection as it is usually realised in TC. However, I see I can simplify this.

About the example, it shows clearly that IDF is considered if the words are connected. I tried to do the same experiment with my data a couple of days ago but all the features had a value of zero. It is clear that the mistake was somewhere else, I should have been more careful.

Thanks for the help :)

MariusHelf · Answer

Hi, why do you apply the feature selection on the test set and not on the training set? The TF-IDF calculation on the test set considers the word vector of the training set, if you connect the wor outputs. Consider this process, especially the value of blu with wor connected or disconnected:

miguel · Answer

That will specify the terms used in the train set and do a filter of the terms in advance. In my case I do feature selection based on chi square after so it is not needed at this stage. However, a follow up question would be, if we connect the train words, will the test set use them as set of words (for filtering only) or it will also know which terms were in which (or at least how many) documents?

For the experiments I am running at the moment, even when words are plugged-in, they only use the list as a filter. Therefore, IDF is still computed from the test set. Good point though :)

Thanks a lot for the rapid response,