"Undoing the cosine normalization in 'Process Documents' operator"

unit01 · September 2012

Hello,

I have noticed that 'Process Documents' does not output term frequencies, when the coresponding mode is selected. As stated in http://rapid-i.com/rapidforum/index.php/topic,3728.msg13943.html#msg13943, cosine normalization is applied to the raw frequencies before outputting the result.

Is there a way to get raw term frequencies for each document, without normalization?

Andrew2 · September 2012

Hello

Yes there is - I had to do the same thing.

http://rapidminernotes.blogspot.co.uk/2011/11/normalizing-rows.html

Basically, get the term occurrences then normalise the rows using the proportion transformation option.

regards

Andrew

unit01 · September 2012

Hello Andrew!

Long story short - you have saved the day once again! To help other RapidMiner newbies, here is a more detailed description of what happened:

1. I have tried using 'Term occurences' before, but thought that this is not the 'number of times a specific term occurs in the doc'. The reason is - when manually counting the number of tokens in a document and comparing that with the sum of term frequency vector counts, these two measures did not match;

2. Simultaneously, when using Andrews sample process, the term frequency vector component sum was correct ???

The problem turned out to be trivial. My RapidMiner process applied term pruning - any terms that occured less than two times in the corpus were removed. However, the tokens output by RapidMiner still included the removed ones - that's why results seemed unexpected. The example provided by Andrew did not apply pruning, therefore results were consistent.

Hope this helps someone

As a side note, I would recommend the RapidMiner team make the 'Process Documents' operator generate term vectors consistently with the token list in order to avoid confusing dumb users like me :P

Thanks, Andrew!

"Undoing the cosine normalization in 'Process Documents' operator"

Answers

Categories