"Undoing the cosine normalization in 'Process Documents' operator"

unit01
unit01 New Altair Community Member
edited November 5 in Community Q&A
Hello,

I have noticed that 'Process Documents' does not output term frequencies, when the coresponding mode is selected. As stated in http://rapid-i.com/rapidforum/index.php/topic,3728.msg13943.html#msg13943, cosine normalization is applied to the raw frequencies before outputting the result.

Is there a way to get raw term frequencies for each document, without normalization?

Answers

  • Andrew2
    Andrew2 New Altair Community Member
    Hello

    Yes there is - I had to do the same thing.
    http://rapidminernotes.blogspot.co.uk/2011/11/normalizing-rows.html
    Basically, get the term occurrences then normalise the rows using the proportion transformation option.

    regards

    Andrew
  • unit01
    unit01 New Altair Community Member
    Hello Andrew!

    Long story short - you have saved the day once again! To help other RapidMiner newbies, here is a more detailed description of what happened:

    1. I have tried using 'Term occurences' before, but thought that this is not the 'number of times a specific term occurs in the doc'. The reason is - when manually counting the number of tokens in a document and comparing that with the sum of term frequency vector counts, these two measures did not match;

    2. Simultaneously, when using Andrews sample process, the term frequency vector component sum was correct  ???

    The problem turned out to be trivial. My RapidMiner process applied term pruning - any terms that occured less than two times in the corpus were removed. However, the tokens output by RapidMiner still included the removed ones - that's why results seemed unexpected. The example provided by Andrew did not apply pruning, therefore results were consistent.

    Hope this helps someone  :) As a side note, I would recommend the RapidMiner team make the 'Process Documents' operator generate term vectors consistently with the token list in order to avoid confusing dumb users like me :P

    Thanks, Andrew!