Term Occurrences and Frequency - I have to be missing something
btibert
New Altair Community Member
I am following along with this post because I wanted to ensure my intuition was correct, because I was seeing results that didn't make sense, to me anyway.
https://community.rapidminer.com/discussion/46333/term-frequencies-and-tf-idf-how-are-these-calculated
The only difference that I see in my process to start is that I am reading in my data from Excel and not creating it by hand.
Here is the term occurrences after making it lower case, extracting stop words, tokenizing, and counting the tokens.
Just the like the post, I am using very simple sentences to keep the vocabulary small.
Now, here is the same exact data, the only difference is that I am now using term frequency within the Process Documents operator
Of course there is a very good change that I am missing a setting along the way, but why is the first example .577 for each of the three words, when the basic sentence, unprocessed, was I like turtles.
Thanks in advance.
https://community.rapidminer.com/discussion/46333/term-frequencies-and-tf-idf-how-are-these-calculated
The only difference that I see in my process to start is that I am reading in my data from Excel and not creating it by hand.
Here is the term occurrences after making it lower case, extracting stop words, tokenizing, and counting the tokens.
Just the like the post, I am using very simple sentences to keep the vocabulary small.
Now, here is the same exact data, the only difference is that I am now using term frequency within the Process Documents operator
Of course there is a very good change that I am missing a setting along the way, but why is the first example .577 for each of the three words, when the basic sentence, unprocessed, was I like turtles.
Thanks in advance.
Tagged:
0
Best Answer
-
Hi @btibert,
To retrieve the results displayed by RapidMiner (according to the thread you shared) :
From the results of Term Occurences :
You calculate the "classic Term Frequency" (for that See the process in attached file) :
Then the term frequency word vectors that are shown in RapidMiner are normalized vectors. This is exactly the same as unit vector normalization that you may have seen in physics classes. In broad brush strokes, the norm of a (Euclidean) vector is its length or size. If you have a 1x2 vector, you can find the norm by simple Pythagorean Theorem. For a 1x7 vector like each document above, you use Pythagorean Theorem but in 7-dimensional space.
Hence the norm of the first document term frequency vector is:SQRT [ (0)^2 + (0)^2 + (0)^2 + (0.333)^2 + (0.333)^2 + (0.333)^2 + (0)^2 ] = 0.577
and the second document term frequency vector is :
idem than the first document
and the third document term frequency vector is:SQRT [ (0.25)^2 + (0.25)^2 + (0.25)^2 + (0)^2 + (0)^2 + (0.25)^2 + (0)^2<span> ] = 0.5</span>
In order to look at all the documents equally, we want all the document vectors to have the same length. So we divide each document term frequency vector by its respective norm to get a "document term frequency unit vector" – also called a normalized term frequency vector :
0 / 0.577 = 0 0.333/0.577 = 0.577 0.25/0.5 = 0.5
So we obtain :
Hope this helps,
Regards,
Lionel2
Answers
-
Hi,
are you sure you don't use TF/IDF?
BR,
Martin0 -
See below, and my dataset/process attached. Entirely possible I am missing something obvious, just not sure what it could be.
0 -
Hi @btibert,
To retrieve the results displayed by RapidMiner (according to the thread you shared) :
From the results of Term Occurences :
You calculate the "classic Term Frequency" (for that See the process in attached file) :
Then the term frequency word vectors that are shown in RapidMiner are normalized vectors. This is exactly the same as unit vector normalization that you may have seen in physics classes. In broad brush strokes, the norm of a (Euclidean) vector is its length or size. If you have a 1x2 vector, you can find the norm by simple Pythagorean Theorem. For a 1x7 vector like each document above, you use Pythagorean Theorem but in 7-dimensional space.
Hence the norm of the first document term frequency vector is:SQRT [ (0)^2 + (0)^2 + (0)^2 + (0.333)^2 + (0.333)^2 + (0.333)^2 + (0)^2 ] = 0.577
and the second document term frequency vector is :
idem than the first document
and the third document term frequency vector is:SQRT [ (0.25)^2 + (0.25)^2 + (0.25)^2 + (0)^2 + (0)^2 + (0.25)^2 + (0)^2<span> ] = 0.5</span>
In order to look at all the documents equally, we want all the document vectors to have the same length. So we divide each document term frequency vector by its respective norm to get a "document term frequency unit vector" – also called a normalized term frequency vector :
0 / 0.577 = 0 0.333/0.577 = 0.577 0.25/0.5 = 0.5
So we obtain :
Hope this helps,
Regards,
Lionel2 -
Absolutely fantastic, thanks! I completely missed (as the title suggested) the normalized part, I just saw the output I expected and stopped reading like a dummy. Many thanks for the example process as well, I haven't had a chance to wrap my head around looping the way you did it, but it appears straight forward enough.3
-
0