How to analyze row for row from MySQL?

Question

I'm planning on using RapidMiner for my thesis, for which I need to analyze the sentiment of tweets.

What I have:
I have got a MySQL database with a table called "tweets", with columns [tweet_id, tweet_datetime, tweet_text, tweet_sentiment]. The table contains 1,3 million rows. The tweets are preprocessed (stopwords removed, words stemmed, etc.).

What I want:* Read the "tweets" database into RapidMiner - done. I have a "Read Database" box in my process overview and it is connected properly
* Read the database row-by-row
* From each row, read tweet_text
* Analyze the sentiment of tweet_text using Bayesian filter
* The Bayesian filter should categorize into positive, neutral and negative
* Store the sentiment into the database in column tweet_sentiment

What my questions are:* How can I feed the Naive Bayes operator one row at a time? I get only errors when connecting the Read Database box to the Naive Bayes box. There needs to be something in between I assume. Perhaps the Process Documents from Data operator?
* A Bayesian filter needs to be defined (define the categories) and manually trained. Can I train the Bayes filter in RapidMiner?
* I have downloaded a wordlist with positive and negative english words. I would like to feed these to the Bayes filter. Is this possible and how?
* How can I let RapidMiner store the outcome of the Bayes filter into the tweets database?

I don't expect all my questions to be answered. I am more than willing to figure things out for myself, but have a hard time starting. And most RapidMiner tutorials use data from other-than-databases. I would very much appreciate some pointers to get started here. :)

IngoRM · Answer

Hi there,

good start: first post and directly a double one  ;)   Please don't do that. This will clutter up the forum and if two people answer you in two different places and both persons happen to work for Rapid-I and one of those could have been doing something really productive, then believe me: you are banned from this forum quicker than you can spell "tweet_sentiment"  ;D

Ok, back to the "less productive" part  ;)  :

Let me first assure that everything you plan is possible with RapidMiner. From my experience, I would, however, not recommend all the steps you have in mind. But feel free to make your own experiences. By the way: sentiment analysis from twitter is particularly hard (lots of abbreviations, shorts texts, many misspellings etc.) so the first recommendation would be: don't use predefined positive and negative word lists, at least not for anything else than defining the labeled text in cases where (almost) only positive or only negative terms are used and make at least 70% or so of the texts. But after this labeling, use the text processing offered by RapidMiner together with a data mining approach which will be much more flexible and adaptive and - probably most important - more accurate when created properly then any word list based approach.

Another recommendation which probably cover most of your questions -> watch the excellent series about text mining with RapidMiner by Neil:

http://www.youtube.com/user/VancouverData/

And here are some more details:
How can I feed the Naive Bayes operator one row at a time? I get only errors when connecting the Read Database box to the Naive Bayes box. There needs to be something in between I assume. Perhaps the Process Documents from Data operator?

One row at a time is possible with some databases and drivers but hard to configure. Much easier is to work on batches which fit into memory. For training, this is in many cases not a problem since you will probably not manually label (or create with help of word lists as described above) millions of texts so the training data in most cases fit into main memory anyway. For scoring / model application, you could use a loop where an iterator (or an iteration macro, you will understand this later if you delve deeper into RapidMiner) is increased and used for selecting and loading only batches of data into memory, score them there, and write the scored valus (for example together with the tweet_ids) into a new table (the operator Write Database supports an mode for appending rows).
A Bayesian filter needs to be defined (define the categories) and manually trained. Can I train the Bayes filter in RapidMiner?

Yes. The operator is called "Naive Bayes" but I would also definitely recommend to check out support vector machines for this task.
I have downloaded a wordlist with positive and negative english words. I would like to feed these to the Bayes filter. Is this possible and how?

I would not use word lists in this way (refer to some thoughts above). Although you could create a merged word list, creating a matching example set with two examples (one for the postive words with 1 as attribute value for the positive terms and one for the negative words with...) as input for Naive Bayes, this would lead to a far too simple model neglecting the true power of data mining.
How can I let RapidMiner store the outcome of the Bayes filter into the tweets database?

As desribed above: loop over batches, create a new table and append the predictions with "Write Database".

Frankly: this is a big project which should be solved by you in months of work. Although most people at Rapid-I would probably be able to do this in a couple of days, it still would need days or weeks for experienced analysts. So please understand that the best I can do (and you probably can hope for) are some general remarks like those above. All these things are possible with RapidMiner and I am sure you will figure out how during your thesis.

Good luck!

Cheers,
Ingo