"[SOLVED] Using tf/idf for text stream classification"

Hi gurus,

I am doing text stream classification and I have notice an issue that makes me confused:
In classic classification problems, I usually do like the following:

1- First divide the data into train set and test set and then use a tf/idf weighting on train set and then train a model on the train set.

2- Subsequently, I use again tf/idf on the test set and apply the model on the weighted test data set.

But right here, in the second step, a data mining-related issue is emerged because of using tf/idf weighting:
As the first experience, I applied the tf/idf on a test set consisting of 10 instances and in the second experiment I applied the tf/idf on a test set containing 20 instances. As you might guess, the results differ significantly. Consequently, applying the model on the first test data set would lead to different results than second data set!

Would anyone please explain what is the right solution? In the other words, in dealing with data streams, shall I use tf/idf at all? If the answer is positive, how many data should I have in each of my test data sets? (Because in regard with text streams, all of the test data are not available at once and I should provide them to the model gradually.)

Find more posts tagged with

AI Studio

Text Mining + NLP

Accepted answers

All comments

siamak_want

Any idea please?

MariusHelf

Hey, I hope I got your question right.

To calculate the TF/IDF of the test set you have to use the same wordlist as for creating it on the train set. To do this, connect the wor output of the Process Documents in training to the Process Documents operator in testing.

If the two operators are in different processes you can Store and Retrieve the wor output in the repository.

Best regards,
Marius

siamak_want

Dear Marius,
Thanks for your answer and my apology for the delay.

You are exactly right about the consistency of the features. But I think the results of tf/idf operator depends to the number of instances too. I have 20 examples in my test set. Please consider the following 2 scenarios:

1) I split my test set to two sets each with 10 examples and apply tf/idf operator on each set separately.

2) I apply the tf/idf operator on the whole 20 examples in my test set.

The vector space models (i.e Examplesets) differs significantly in above scenarios. So, sending the result example set to a classifier may lead to gaining different results. What is the right strategy?

Any help would be appreciated.

siamak_want

Any idea please?

MariusHelf

Hi,

did you really connect to wordlist output of the training process to the wordlist output of the application process?

Best regards,
Marius

siamak_want

Hi Marius,

Yes, and I have no problem with that. My train and test attributes are exactly consistent with each other. My problem is that I get different results while run two above scenarios. And certainly this is due to the size of my test set examples differ in each scenario and so the tf/Idf exampleset differs too.

As you know, tf/idf weight for a specific attribute can be calculated like this:

Normalized value = TF value * weight

Weight = log ( NumberOfDocuments / NumberOfDocumentsWhichContainTheAttribute)

So the value of weight depends on the size of my test set. Now, I don't know how much data I should have in my test sets? 10? 20? just 1? Maybe I should not use tf/idf weighting for text stream classification at all. Please help me.

Thanks in advance.

siamak_want

Any help please?

MariusHelf

HI,

I can't reproduce this behavior. Can you please post your process setup and post the versions of RapidMiner and the Text Extension?

Best regards,
Marius

siamak_want

Dear Marius,

Thank you very much for your explanations. You were right and it works now. Marius do you know the tf/idf formula used in RM?

Again, thanks for your valuable explanations.
Siamak