Term Frequencies greater than 1

spok2
spok2 New Altair Community Member
edited November 2024 in Community Q&A
Dear all,

I use "Text Processing - Process Documents From Files" to calculate word vectors for documents.

As I read here: http://rapid-i.com/rapidforum/index.php?PHPSESSID=0aba344304fbb94614ad24f236d974e4&;topic=3728.0
term frequencies are normalized (as I expected).

For me this means that term frequencies always have values < 1.
In my case I use TF-IDF for vector creation as proposed, and get some term frequencies in the range of 1E+10 or 1E+11.
Looking at the related documents they appear to be "normal".

Any ideas why this happens? What I´m not understanding?



Tagged:

Answers

  • spok2
    spok2 New Altair Community Member
    No one with an idea?
    Do I think wrong?
    Can term frequencies be greater than 1?
    Are there circumstances where it is better to use method for vector creation?
    Under which circumstances which method for vector creation is most appropriate?

    Many thanks in advance for any hint ...
  • MartinLiebig
    MartinLiebig
    Altair Employee
    1E11 sounds like some error because you need to devide by log(something). if somethings is close to zero, some problems might occur in the precision of double..
  • mohammadreza
    mohammadreza New Altair Community Member
    As Martin mentioned the 1E+11 is too odd for a normalized term frequency. I suggest trying "binary term occurrences" for vector creation first to see if you will get the anticipated results or not. If the result is somehow acceptable and you don't have something strange in your process there might be a numerical problem which happens mostly because of extremely low term frequencies which I believe can be mitigated by using the "prune" property. You need to just define a lower bound and upper bound for the occurrences to avoid having extremely low values.
  • spok2
    spok2 New Altair Community Member
    Dear all,

    thanks a lot for your hints ...

    I´ll try and see.

    BR