Term Frequencies greater than 1
spok2
New Altair Community Member
Dear all,
I use "Text Processing - Process Documents From Files" to calculate word vectors for documents.
As I read here: http://rapid-i.com/rapidforum/index.php?PHPSESSID=0aba344304fbb94614ad24f236d974e4&;topic=3728.0
term frequencies are normalized (as I expected).
For me this means that term frequencies always have values < 1.
In my case I use TF-IDF for vector creation as proposed, and get some term frequencies in the range of 1E+10 or 1E+11.
Looking at the related documents they appear to be "normal".
Any ideas why this happens? What I´m not understanding?
I use "Text Processing - Process Documents From Files" to calculate word vectors for documents.
As I read here: http://rapid-i.com/rapidforum/index.php?PHPSESSID=0aba344304fbb94614ad24f236d974e4&;topic=3728.0
term frequencies are normalized (as I expected).
For me this means that term frequencies always have values < 1.
In my case I use TF-IDF for vector creation as proposed, and get some term frequencies in the range of 1E+10 or 1E+11.
Looking at the related documents they appear to be "normal".
Any ideas why this happens? What I´m not understanding?
Tagged:
0
Answers
-
No one with an idea?
Do I think wrong?
Can term frequencies be greater than 1?
Are there circumstances where it is better to use method for vector creation?
Under which circumstances which method for vector creation is most appropriate?
Many thanks in advance for any hint ...0 -
1E11 sounds like some error because you need to devide by log(something). if somethings is close to zero, some problems might occur in the precision of double..0
-
As Martin mentioned the 1E+11 is too odd for a normalized term frequency. I suggest trying "binary term occurrences" for vector creation first to see if you will get the anticipated results or not. If the result is somehow acceptable and you don't have something strange in your process there might be a numerical problem which happens mostly because of extremely low term frequencies which I believe can be mitigated by using the "prune" property. You need to just define a lower bound and upper bound for the occurrences to avoid having extremely low values.0
-
Dear all,
thanks a lot for your hints ...
I´ll try and see.
BR0