"Text Scoring Via Word Tagging"

soonnicholas
soonnicholas New Altair Community Member
edited November 5 in Community Q&A
Dear Rapid Miner Admin,

I was wondering if the text plugin could be configured to use an external list of words tags to score documents ; In this specific case, the Harvard-IV dictionary which is used by the General Inquirer program which is perhaps one of the oldest text sentiment extraction tools.  The web java version of the said program can be found here http://www.webuse.umd.edu:9090/

Example:

The Harvard-IV dictionary contains several categories which are scored from each text document according to the term frequency of the words which appear in each category.

Positive  1045 positive words, Words such as good, rose, happy will go towards the Positive score
Negative  1160 negative words, Words such as bad, fell, sad will go towards the Negative score

Is there a way to get Rapid miner to handle this ?  I have used the dictionary stemmer, but it seems inaccurate (significantly different results from the General Inquirer scores using the same text)

Best regards

Nicholas

Answers

  • land
    land New Altair Community Member
    Hi Nicholas,
    if I understood you correctly, you need some sort of positive list, so that only the 2205 words inside the both lists are counted for Term Frequency Vector creation? If that's the case, you might try to build your own input_word_list. But you have to think about how to count an occurrence of a word. Will you use binary counting?

    The Dictionary stemmer doesn't work this way, because it only replaces tokens, but does not remove unknown tokens.

    Greetings,
      Sebastian