TF-IDF calculation

Question

Hi.

It seems that RapidMiner's TextInput operator calculates TF-IDF for the whole document corpus it reads. In text classification, however, corpus-based keyword selection (based on TF-IDF) favors prevailing classes and penalizes classes with small number of training documents. Class-based keyword selection on the other hand gives equal weight to each class. So, my question is how could one calculate TF-IDF for each label separately, i.e. treating each label in the body of the TextInput operator as a separate corpus?

Thanks in advance for your help.

land · Answer

Hi,
it seems to me that it will be difficult to apply a model in testing phase, when you can't know the actual label. Which classes frequency are you going to use if you don't have information about the class?

If you are just doing exploratory analysis, you might use the loop values operator to get each label value as a macro and filter the example set accordingly.

Greetings,
  Sebastian