Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
"[SOLVED] Using tf/idf for text stream classification"
siamak_want
Hi gurus,
I am doing text stream classification and I have notice an issue that makes me confused:
In classic classification problems, I usually do like the following:
1- First divide the data into train set and test set and then use a
tf/idf weighting
on train set and then train a model on the train set.
2- Subsequently, I
use again tf/idf on the test set
and apply the model on the weighted test data set.
But right here, in the second step, a data mining-related issue is emerged because of using tf/idf weighting:
As the first experience, I applied the tf/idf on a test set consisting of
10 instances
and in the second experiment I applied the tf/idf on a test set containing
20 instances
. As you might guess, the results differ significantly. Consequently, applying the model on the first test data set would lead to different results than second data set!
Would anyone please explain what is the right solution? In the other words, in dealing with data streams, shall I use tf/idf at all? If the answer is positive, how many data should I have in each of my test data sets? (Because in regard with text streams, all of the test data are not available at once and I should provide them to the model gradually.)
Find more posts tagged with
AI Studio
Text Mining + NLP
Accepted answers
All comments
siamak_want
Any idea please?
MariusHelf
Hey, I hope I got your question right.
To calculate the TF/IDF of the test set you have to use the same wordlist as for creating it on the train set. To do this, connect the wor output of the Process Documents in training to the Process Documents operator in testing.
If the two operators are in different processes you can Store and Retrieve the wor output in the repository.
Best regards,
Marius
siamak_want
Dear Marius,
Thanks for your answer and my apology for the delay.
You are exactly right about the consistency of the features. But I think the results of tf/idf operator depends to the number of instances too. I have 20 examples in my test set. Please consider the following 2 scenarios:
1) I split my test set to two sets each with 10 examples and apply tf/idf operator on each set separately.
2) I apply the tf/idf operator on the whole 20 examples in my test set.
The vector space models (i.e Examplesets) differs significantly in above scenarios. So, sending the result example set to a classifier may lead to gaining different results. What is the right strategy?
Any help would be appreciated.
siamak_want
Any idea please?
MariusHelf
Hi,
did you really connect to wordlist output of the training process to the wordlist output of the application process?
Best regards,
Marius
siamak_want
Hi Marius,
Yes, and I have no problem with that. My train and test attributes are exactly consistent with each other. My problem is that I get different results while run two above scenarios. And certainly this is due to the size of my test set examples differ in each scenario and so the tf/Idf exampleset differs too.
As you know, tf/idf weight for a specific attribute can be calculated like this:
Normalized value = TF value * weight
Weight = log ( NumberOfDocuments / NumberOfDocumentsWhichContainTheAttribute)
So the value of weight depends on the size of my test set. Now, I don't know how much data I should have in my test sets? 10? 20? just 1? Maybe I should not use tf/idf weighting for text stream classification at all. Please help me.
Thanks in advance.
siamak_want
Any help please?
MariusHelf
HI,
I can't reproduce this behavior. Can you please post your process setup and post the versions of RapidMiner and the Text Extension?
Best regards,
Marius
siamak_want
Dear Marius,
Thank you very much for your explanations. You were right and it works now. Marius do you know the tf/idf formula used in RM?
Again, thanks for your valuable explanations.
Siamak
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups