Term Occurrences and Frequency - I have to be missing something

btibert
btibert New Altair Community Member
edited November 2024 in Community Q&A
I am following along with this post because I wanted to ensure my intuition was correct, because I was seeing results that didn't make sense, to me anyway.
https://community.rapidminer.com/discussion/46333/term-frequencies-and-tf-idf-how-are-these-calculated

The only difference that I see in my process to start is that I am reading in my data from Excel and not creating it by hand.

Here is the term occurrences after making it lower case, extracting stop words, tokenizing, and counting the tokens.



Just the like the post, I am using very simple sentences to keep the vocabulary small.

Now, here is the same exact data, the only difference is that I am now using term frequency within the Process Documents operator



Of course there is a very good change that I am missing a setting along the way, but why is the first example .577 for each of the three words, when the basic sentence, unprocessed, was I like turtles.

Thanks in advance.
Tagged:

Best Answer

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi @btibert,

    To retrieve the results displayed by RapidMiner (according to the thread you shared) : 

    From the results of Term Occurences  : 



    You calculate the "classic Term Frequency" (for that See the process in attached file) : 


    Then  the term frequency word vectors that are shown in RapidMiner are normalized vectors. This is exactly the same as unit vector normalization that you may have seen in physics classes.  In broad brush strokes, the norm of a (Euclidean) vector is its length or size. If you have a 1x2 vector, you can find the norm by simple Pythagorean Theorem. For a 1x7 vector like each document above, you use Pythagorean Theorem but in 7-dimensional space. 
    Hence the norm of the first document term frequency vector is:

    SQRT [ (0)^2 + (0)^2 + (0)^2 + (0.333)^2 + (0.333)^2 + (0.333)^2 + (0)^2 ] = 0.577

    and the second document term frequency vector is :

    idem than the first document

    and the third document term frequency vector is:
    SQRT [ (0.25)^2 + (0.25)^2 + (0.25)^2 + (0)^2 + (0)^2 + (0.25)^2 + (0)^2<span> ] = 0.5</span>


    In order to look at all the documents equally, we want all the document vectors to have the same length. So we divide each document term frequency vector by its respective norm to get a "document term frequency unit vector" – also called a normalized term frequency vector : 

     0 / 0.577 = 0              0.333/0.577 = 0.577      0.25/0.5 = 0.5

    So we obtain : 

        
    Hope this helps,

    Regards,

    Lionel

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,
    are you sure you don't use TF/IDF?

    BR,
    Martin
  • btibert
    btibert New Altair Community Member
    See below, and my dataset/process attached. Entirely possible I am missing something obvious, just not sure what it could be.


  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi @btibert,

    To retrieve the results displayed by RapidMiner (according to the thread you shared) : 

    From the results of Term Occurences  : 



    You calculate the "classic Term Frequency" (for that See the process in attached file) : 


    Then  the term frequency word vectors that are shown in RapidMiner are normalized vectors. This is exactly the same as unit vector normalization that you may have seen in physics classes.  In broad brush strokes, the norm of a (Euclidean) vector is its length or size. If you have a 1x2 vector, you can find the norm by simple Pythagorean Theorem. For a 1x7 vector like each document above, you use Pythagorean Theorem but in 7-dimensional space. 
    Hence the norm of the first document term frequency vector is:

    SQRT [ (0)^2 + (0)^2 + (0)^2 + (0.333)^2 + (0.333)^2 + (0.333)^2 + (0)^2 ] = 0.577

    and the second document term frequency vector is :

    idem than the first document

    and the third document term frequency vector is:
    SQRT [ (0.25)^2 + (0.25)^2 + (0.25)^2 + (0)^2 + (0)^2 + (0.25)^2 + (0)^2<span> ] = 0.5</span>


    In order to look at all the documents equally, we want all the document vectors to have the same length. So we divide each document term frequency vector by its respective norm to get a "document term frequency unit vector" – also called a normalized term frequency vector : 

     0 / 0.577 = 0              0.333/0.577 = 0.577      0.25/0.5 = 0.5

    So we obtain : 

        
    Hope this helps,

    Regards,

    Lionel
  • btibert
    btibert New Altair Community Member
    Absolutely fantastic, thanks!  I completely missed (as the title suggested) the normalized part, I just saw the output I expected and stopped reading like a dummy.  Many thanks for the example process as well, I haven't had a chance to wrap my head around looping the way you did it, but it appears straight forward enough.
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    You're welcome, @btibert

    Regards,

    Lionel