Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
TFIDF Output
Flixport
Hello RM family,
I have concerning the TF IDF vector the question to the output, why it gives me under the value 0.34 (see screen 1) the words. That shouldn't happen, should it?
screen1
screen2
BR
Find more posts tagged with
AI Studio
Accepted answers
Telcontar120
The values of TF-IDF are not term frequency values in terms of percentages, nor are they across all documents. That is why they are not directly comparable to the pruning input parameters, which is what
@sgenzer
said above. The pruning parameter is the percent of all documents that you want the term to appear in (either min or max). It should typically be entered as a whole number from 1-100.
@Prentice
as far as whether there is a bug based on inputting contradictory values, that may be the case, but it would be a separate issue. Can you post an example process with data to show?
All comments
sgenzer
Hi
@Flixport
so those numbers are not at all the same thing. I wrote a rather lengthy KB article explaining how TF-IDF is done - you can find it here:
https://community.rapidminer.com/discussion/52861/term-frequencies-and-tf-idf-how-are-these-calculated
Flixport
Hi
@sgenzer
i already read it, but the description of Prune Method states that the values below X should be ignored, which should be the case when I specify the Custom. This means that the output should ignore the values according to the Prune method.
Prentice
Hi,
I could be wrong, but I don't your post (
@sgenzer)
solves the question.
First of all, you selected a percentage, meaning you should express it in percentages: 34 and 80.
However if I try this, it won't work for me as well. To make it even stranger, if I set the prune below percent higher than the prune above percent I still get some values even though it should not be possible.
Very peculiar.
Maybe that it's a bug?
Telcontar120
The values of TF-IDF are not term frequency values in terms of percentages, nor are they across all documents. That is why they are not directly comparable to the pruning input parameters, which is what
@sgenzer
said above. The pruning parameter is the percent of all documents that you want the term to appear in (either min or max). It should typically be entered as a whole number from 1-100.
@Prentice
as far as whether there is a bug based on inputting contradictory values, that may be the case, but it would be a separate issue. Can you post an example process with data to show?
Prentice
Telcontar120
, I actually think that I already see what's happening here.
It's first pruning the values and after that it calculates the TF-IDF for the remaining values as far as my knowledge goes
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups