[solved] Processing txt create sentiment analysis from mined twitter msgs

S_Green84
S_Green84 New Altair Community Member
edited November 5 in Altair RapidMiner
Hi,
I am working on a project to create a (simple) sentiment analysis from several large text files (between 2 and 20gb) from mined twitter messages.
I have no computer science background and just found rapidminer the other day. Now Iam curious if it will be possible to use for my purpose.

All tweets are stored in a simple text file in the following format:
T 2009-06-07 02:07:41
U http://twitter.com/cyberplumber
W SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw

I would like to create a sentiment index (positive / negative) for each single day.
The source for the sentiment index shall be as simple as possible. Therefore I thought to just define a few adjectives for each spectrum. Additionally / as a different index I would like to count the positive / negative smileys in each tweet.

As the dataset is in total 70gb I probably will have to create a (postgre sql) database first? I am right now trying to find a way to get the text files into a nice sql table (first time for me!). As my source is not a csv and instead of using commas they used "letter T/U/W and tab" for seperation I am also not quiet sure how to do this.

So my general question:
Is it possible to use rapidminer to perform this kind of sentiment analysis?
Is there maybe a viable option to use rapidminer for those large textfiles and circumvent creating a sql table (which has the difficulty of parsing the textfiles first).
Which tutorials / articles can you recommend me reading? (I found the vancouver data ones and they seem good)

If somebody here is willing to "coach" me for a couple hours to get me on track for my project in return for a small compensation ($20/hr) I would very much appreciate this. Just send me a msg to exchange skype.

Thank you for reading!


Edit:

Ok, I used the following python script to import the tweets to a postgresql:

#!/usr/bin/python

import sys
import psycopg2

db = psycopg2.connect(host="localhost", port=12345, database="db", user="postgres", password="pw")

class Tweet:
def Tweet(self, date, user, text):
self.date = date
self.user = user
self.text = text

def insert_into_db(tweet):
global db
print "insert ", tweet.date, tweet.user, tweet.text
try:
db.set_isolation_level(0)
cursor = db.cursor()
cursor.execute("""INSERT INTO tweets (timestamp, userid, tweet) VALUES (%s, %s, %s)""", (tweet.date, tweet.user, tweet.text))
db.commit()
except Exception as e:
print "ERROR", e

current = Tweet

def process_data(piece):
global current
for line in piece.split("\n"):
#print line
if (line.startswith("T\t")):
current.date = line[2:]
if (line.startswith("U\t")):
current.user = line[2 + len("http://twitter.com/"):]
if (line.startswith("W\t")):
current.text = line[2:]
insert_into_db(current)
current = Tweet



def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

f = open(sys.argv[1])
for piece in read_in_chunks(f):
    process_data(piece)



And I use the following structure in rapidminer (taken from bi cortex example):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_database" compatibility="5.3.007" expanded="true" height="60" name="Read Database" width="90" x="45" y="120">
        <parameter key="connection" value="test_data"/>
        <parameter key="query" value="(SELECT &quot;tweet&quot;, &quot;id_sent&quot;, &quot;sentiment_boo&quot;&#10;FROM &quot;public&quot;.&quot;sentiment&quot;&#10;WHERE &quot;sentiment_boo&quot; = 't'&#10;limit 10000)&#10;union&#10;(SELECT &quot;tweet&quot;, &quot;id_sent&quot;, &quot;sentiment_boo&quot;&#10;FROM &quot;public&quot;.&quot;sentiment&quot;&#10;WHERE &quot;sentiment_boo&quot; = 'f'&#10;limit 10000)&#10;"/>
        <parameter key="table_name" value="Sample_Feeds"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role (3)" width="90" x="179" y="120">
        <parameter key="attribute_name" value="id_sent"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.3.007" expanded="true" height="76" name="Nominal to Text" width="90" x="246" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="tweet"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="120">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_above_percent" value="90.0"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="30">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role" width="90" x="447" y="120">
        <parameter key="attribute_name" value="sentiment_boo"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.3.007" expanded="true" height="112" name="Validation" width="90" x="581" y="120">
        <parameter key="number_of_validations" value="5"/>
        <process expanded="true">
          <operator activated="true" class="select_attributes" compatibility="5.3.007" expanded="true" height="76" name="Select Attributes" width="90" x="45" y="30">
            <parameter key="attribute_filter_type" value="no_missing_values"/>
            <parameter key="attribute" value="text"/>
          </operator>
          <operator activated="true" class="nominal_to_binominal" compatibility="5.3.007" expanded="true" height="94" name="Nominal to Binominal" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="sentiment_boo"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="support_vector_machine_linear" compatibility="5.3.007" expanded="true" height="76" name="SVM (Linear)" width="90" x="179" y="210"/>
          <connect from_port="training" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Binominal" to_port="example set input"/>
          <connect from_op="Nominal to Binominal" from_port="example set output" to_op="SVM (Linear)" to_port="training set"/>
          <connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="5.3.007" expanded="true" height="76" name="Apply Model" width="90" x="45" y="75">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.3.007" expanded="true" height="76" name="Performance" width="90" x="179" y="120"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="read_database" compatibility="5.3.007" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="210">
        <parameter key="connection" value="test_data"/>
        <parameter key="query" value="SELECT &quot;feed&quot;, &quot;id_serial&quot;&#10;FROM &quot;public&quot;.&quot;test_data&quot;&#10;limit 100"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role (4)" width="90" x="179" y="210">
        <parameter key="attribute_name" value="id_serial"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.3.007" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="246" y="345">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="feed"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_above_percent" value="90.0"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (2)" width="90" x="112" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases (2)" width="90" x="246" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="380" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="514" y="30">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="210">
        <parameter key="attribute_name" value="text"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.3.007" expanded="true" height="76" name="Apply Model (2)" width="90" x="648" y="345">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Validation" from_port="training" to_port="result 1"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
      <connect from_op="Read Database (2)" from_port="output" to_op="Set Role (4)" to_port="example set input"/>
      <connect from_op="Set Role (4)" from_port="example set output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>