"Sentiment Analysis - Numerical Labels, and the search for the right Process"

andk
andk New Altair Community Member
edited November 5 in Community Q&A
I have got a question again which might be easy to answer for those of you who already played around with the Sentiment Analysis qualities of Rapidminer. On the one hand I have a collection of thousands of documents where i extracted the information I need and compiled a matrix with the concerning T-IDF scores of expressions appearing in the documents. On the other hand I have a matrix with words which also contains a certain sentiment score between 0 and 1 attributed to each word. The question is now how to bring these two strings together to measure the sentiments reflected in the documents over time. The idea now is match the T-IDF matrix with the word/sentiment score matrix. Or more precisely, I want to look which expressions of the sentiment matrix also appear in the concerning documents and weight them with the respective IDF values. Is there a process which does this? I tried to go along the example described here http://rapid-i.com/rapidforum/index.php/topic,2993.0.html and the classification approach presented in the Vancouver Data Blog Video Tutorial 5 but it seems that the problem hinges on the fact that the Learning Processes don't accept numerical labels. Could somebody give me a hint? I would really appreciate that!

Best regards,

André     

Answers

  • land
    land New Altair Community Member
    Hi Andre,
    this is a very unusual approach. Normally you want to avoid to put up this sentiment/word matrix yourself and let it do the program! You normally assign all your documents a certain sentiment and then apply a learning scheme to derive the effects.
    If you have manually assigned these factors, you have done data mining manually and derived some sort of a linear model. What you have to do is to put them into a model so that you are able to apply them. There's no suggested way for this, because, well, as I said: Nobody normally wants to do this.
    Only thing I can imagine is exporting a linear regression model in XML and then manually edit this file and reimporting it...

    Greetings,
      Sebastian
  • andk
    andk New Altair Community Member
    sebastian, thanks for your reply. i was deconnected the last two days. i think there is a misunderstanding or i explained my problem a little bit complicated. i have a wordvector with attributed sentiment values for each word. it is from sentiwordnet i just calculated a useful measure for my purpose out of the given values. additionally i have a wordlist and and idf matrix respectively gained by normal wordprocessing out of a pretty huge amount of documents. my idea now is to create a wordlist out of both dataprocessing processes and match them against each other. this means i want to look which of the expressions for which i have a sentiment value appears in the word list extracted of the documents. i tried to do this with the cross distance process. but the wordlist results from the document processing process, and in order to select the right attribute i have to transform the wordlist with the data to wordlist process. it turns out that the wordlist to data processor formats the expressions in my wordlist to a polynominal for and it seems that the crossdistance processor can't handle this. which parameter of the crossdistance process would be the right one to match nominal expressions?
    guys i hope i don't nerve you too much. as far as this is possible i will also contribute on the helping side in this forum!

    best regards, andre 





  • land
    land New Altair Community Member
    Hi,
    I think you explained what you are doing in an understandable way, but I don't WHY you should do this? What would be the meaning of the result?

    Greetings,
      Sebastian
  • andk
    andk New Altair Community Member
    it should simply give me the possibility to estimate the sentiment of an article as i already have sentiments for several tokens of it. so actually i have two word lists one from my articles and one with attributed sentiments and i have to link these two parts or in other words i have to look which of my sentiment tokens appears in which article. i am unfortunately lacking the technical skills to use this cross distance operator right  because i think this should be actually the right operator for me. anyway thank you sebastian for your effort! but if you should come across this topic again and you would have an idea it would be very helpful to share it with me.

    best regards, andre
  • andk
    andk New Altair Community Member
    is there really nobody who could help me?  Just to clearify a bit more what I want to do and make it more attractive ^^ to help me i have created tables to show what i would like to do.
    Sentiment Wordlist (created from a CSV file)(Tab1)
    ID  Word  Sentimentscore
    able 0.7
    cat 0
    competent 0.6
    corrupt -0.6
    house 0.1
    ... ...
    The Wordlist gained through processing documents (Tab2)
    ID  Word 
    able
    cow
    house
    competent
    computer
    ...
    Now I want to look if and where there are matches between word columns of Tab1 and Tab2. The best thing would be to have a vector with distance or similarity measures for all combinations of words of Tab1 and Tab2. Also the metainfo, Sentimentscore, should not be lost in this process. Is there something which could help me in this taks. This oculd maybe look like this:
    Tab1  Tab2  Distance  Sentimentscore 
    able  able  0.7 
    able  cow  0.7 
    able  house  0.7 
    ...  ...  ...  ... 
    competent  competent  0.6 
    ...  ...  ...  ... 
    I want to underline that this is just for academic, self interest purposes. I am orientating myself what I could do in my thesis and play around a little bit with RM. I am looking forward to your comments!

    Best regards,

    André
  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    are the words in Tab2 unique (I guess they are at least in Tab1)? If yes, a simple "Join" would be sufficient with the word columns as IDs if you are interested in "full match" (distance 0) vs. "no match" (distance 1) only.

    Otherwise a more complex process has to be created which would definitely also be possible.

    Cheers,
    Ingo
  • andk
    andk New Altair Community Member
    ingo thanks alot! ahhhh  :) ok this is an approach which i will test as soon as i am on my windows RM machine again. how would such a more complex process which distances look like? i don't need details just a hint or a sketch which operators might work and how the roles of the word attribute would have to be set! thanks for your help!
    merci!

    andré
  • IngoRM
    IngoRM New Altair Community Member
    Hi again,

    actually, even if the words in Tab2 are not unique, the join approach should work pretty well. You will end up (depending on using a left or a right join) with a data set Tab2 with an additional column containing the corresponding sentiment scores from Tab1. A simple aggregation (average or sum) will then deliver the final, aggregated score for the document encoded in Tab2.

    Well, if you want to calculate text based similarities, I would have a look into the Text Extension of RapidMiner and use the preprocessing operators delivered. You could, for example, transform the words into their stems, use character n-grams and other approaches for calculating the distances between the terms in both tables. Of course it would also be possible to loop through both tables and perform any type of distance measure you can build with operators inside. Finally, you could of course write your own distance measure and use it within RapidMiner. There are probably hundreds of options. Have fun trying them!  :D

    Cheers,
    Ingo
  • andk
    andk New Altair Community Member
    Ingo you are a hero! Thank you very much! I will try your advices and will report!

    Have a nice weekend!