entropy
rafeena
New Altair Community Member
if i would like to calculate the entropy for each word, during my preprocessing what should i set my word vector to? it would not be advisable to set it to TFIDF right?
Tagged:
1
Best Answer
-
In that case, yes, it will affect entropy because the calculation of TFIDF is not simply a linear transformation of frequency. It is impossible to say in advance which would give you better results. As I mentioned before, I would probably start with term occurrences first since that is more representative of the data in its raw form. RapidMiner will allow you to easily do it both ways and compare the results!
1
Answers
-
Can you clarify, what do you mean by calculating the entropy of each word? Vectorization is simple preprocessing of texts in an unsupervised fashion, whereas entropy usually is with respect to a label. So there is no built-in vector metric that would supply anything like a conventional entropy measure. If you are asking which vector you should use if you want to calculate entropy later, then I would think the simple term occurrences would be the appropriate one since that is merely a count of all instances of a given token in a given document.0
-
i would like to use entropy and TFIDF as my feature selection method. i would like to know will it effect the entropy result if i set the word vector to TFIDF.0
-
In that case, yes, it will affect entropy because the calculation of TFIDF is not simply a linear transformation of frequency. It is impossible to say in advance which would give you better results. As I mentioned before, I would probably start with term occurrences first since that is more representative of the data in its raw form. RapidMiner will allow you to easily do it both ways and compare the results!
1