"Undoing the cosine normalization in 'Process Documents' operator"
unit01
New Altair Community Member
Hello,
I have noticed that 'Process Documents' does not output term frequencies, when the coresponding mode is selected. As stated in http://rapid-i.com/rapidforum/index.php/topic,3728.msg13943.html#msg13943, cosine normalization is applied to the raw frequencies before outputting the result.
Is there a way to get raw term frequencies for each document, without normalization?
I have noticed that 'Process Documents' does not output term frequencies, when the coresponding mode is selected. As stated in http://rapid-i.com/rapidforum/index.php/topic,3728.msg13943.html#msg13943, cosine normalization is applied to the raw frequencies before outputting the result.
Is there a way to get raw term frequencies for each document, without normalization?
Tagged:
0
Answers
-
Hello
Yes there is - I had to do the same thing.http://rapidminernotes.blogspot.co.uk/2011/11/normalizing-rows.html
Basically, get the term occurrences then normalise the rows using the proportion transformation option.
regards
Andrew0 -
Hello Andrew!
Long story short - you have saved the day once again! To help other RapidMiner newbies, here is a more detailed description of what happened:
1. I have tried using 'Term occurences' before, but thought that this is not the 'number of times a specific term occurs in the doc'. The reason is - when manually counting the number of tokens in a document and comparing that with the sum of term frequency vector counts, these two measures did not match;
2. Simultaneously, when using Andrews sample process, the term frequency vector component sum was correct ???
The problem turned out to be trivial. My RapidMiner process applied term pruning - any terms that occured less than two times in the corpus were removed. However, the tokens output by RapidMiner still included the removed ones - that's why results seemed unexpected. The example provided by Andrew did not apply pruning, therefore results were consistent.
Hope this helps someone As a side note, I would recommend the RapidMiner team make the 'Process Documents' operator generate term vectors consistently with the token list in order to avoid confusing dumb users like me :P
Thanks, Andrew!0