Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

[solved] Processing txt create sentiment analysis from mined twitter msgs

Hi,
I am working on a project to create a (simple) sentiment analysis from several large text files (between 2 and 20gb) from mined twitter messages.
I have no computer science background and just found rapidminer the other day. Now Iam curious if it will be possible to use for my purpose.

All tweets are stored in a simple text file in the following format:
T 2009-06-07 02:07:41
U http://twitter.com/cyberplumber
W SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw

I would like to create a sentiment index (positive / negative) for each single day.
The source for the sentiment index shall be as simple as possible. Therefore I thought to just define a few adjectives for each spectrum. Additionally / as a different index I would like to count the positive / negative smileys in each tweet.

As the dataset is in total 70gb I probably will have to create a (postgre sql) database first? I am right now trying to find a way to get the text files into a nice sql table (first time for me!). As my source is not a csv and instead of using commas they used "letter T/U/W and tab" for seperation I am also not quiet sure how to do this.

So my general question:
Is it possible to use rapidminer to perform this kind of sentiment analysis?
Is there maybe a viable option to use rapidminer for those large textfiles and circumvent creating a sql table (which has the difficulty of parsing the textfiles first).
Which tutorials / articles can you recommend me reading? (I found the vancouver data ones and they seem good)

If somebody here is willing to "coach" me for a couple hours to get me on track for my project in return for a small compensation ($20/hr) I would very much appreciate this. Just send me a msg to exchange skype.

Thank you for reading!

Edit:

Ok, I used the following python script to import the tweets to a postgresql:

#!/usr/bin/python

import sys
import psycopg2

db = psycopg2.connect(host="localhost", port=12345, database="db", user="postgres", password="pw")

class Tweet:
def Tweet(self, date, user, text):
self.date = date
self.user = user
self.text = text

def insert_into_db(tweet):
global db
print "insert ", tweet.date, tweet.user, tweet.text
try:
db.set_isolation_level(0)
cursor = db.cursor()
cursor.execute("""INSERT INTO tweets (timestamp, userid, tweet) VALUES (%s, %s, %s)""", (tweet.date, tweet.user, tweet.text))
db.commit()
except Exception as e:
print "ERROR", e

current = Tweet

def process_data(piece):
global current
for line in piece.split("\n"):
#print line
if (line.startswith("T\t")):
current.date = line[2:]
if (line.startswith("U\t")):
current.user = line[2 + len("http://twitter.com/"):]
if (line.startswith("W\t")):
current.text = line[2:]
insert_into_db(current)
current = Tweet

def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data

f = open(sys.argv[1])
for piece in read_in_chunks(f):
process_data(piece)

And I use the following structure in rapidminer (taken from bi cortex example):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_database" compatibility="5.3.007" expanded="true" height="60" name="Read Database" width="90" x="45" y="120">
<parameter key="connection" value="test_data"/>
<parameter key="query" value="(SELECT "tweet", "id_sent", "sentiment_boo"
FROM "public"."sentiment"
WHERE "sentiment_boo" = 't'
limit 10000)
union
(SELECT "tweet", "id_sent", "sentiment_boo"
FROM "public"."sentiment"
WHERE "sentiment_boo" = 'f'
limit 10000)
"/>
<parameter key="table_name" value="Sample_Feeds"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role (3)" width="90" x="179" y="120">
<parameter key="attribute_name" value="id_sent"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.3.007" expanded="true" height="76" name="Nominal to Text" width="90" x="246" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="tweet"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="120">
<parameter key="keep_text" value="true"/>
<parameter key="prune_above_percent" value="90.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="30">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role" width="90" x="447" y="120">
<parameter key="attribute_name" value="sentiment_boo"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="x_validation" compatibility="5.3.007" expanded="true" height="112" name="Validation" width="90" x="581" y="120">
<parameter key="number_of_validations" value="5"/>
<process expanded="true">
<operator activated="true" class="select_attributes" compatibility="5.3.007" expanded="true" height="76" name="Select Attributes" width="90" x="45" y="30">
<parameter key="attribute_filter_type" value="no_missing_values"/>
<parameter key="attribute" value="text"/>
</operator>
<operator activated="true" class="nominal_to_binominal" compatibility="5.3.007" expanded="true" height="94" name="Nominal to Binominal" width="90" x="179" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="sentiment_boo"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="support_vector_machine_linear" compatibility="5.3.007" expanded="true" height="76" name="SVM (Linear)" width="90" x="179" y="210"/>
<connect from_port="training" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Binominal" to_port="example set input"/>
<connect from_op="Nominal to Binominal" from_port="example set output" to_op="SVM (Linear)" to_port="training set"/>
<connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.3.007" expanded="true" height="76" name="Apply Model" width="90" x="45" y="75">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.3.007" expanded="true" height="76" name="Performance" width="90" x="179" y="120"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="read_database" compatibility="5.3.007" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="210">
<parameter key="connection" value="test_data"/>
<parameter key="query" value="SELECT "feed", "id_serial"
FROM "public"."test_data"
limit 100"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role (4)" width="90" x="179" y="210">
<parameter key="attribute_name" value="id_serial"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.3.007" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="246" y="345">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="feed"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
<parameter key="keep_text" value="true"/>
<parameter key="prune_above_percent" value="90.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (2)" width="90" x="112" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases (2)" width="90" x="246" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="380" y="30"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="514" y="30">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="210">
<parameter key="attribute_name" value="text"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.3.007" expanded="true" height="76" name="Apply Model (2)" width="90" x="648" y="345">
<list key="application_parameters"/>
</operator>
<connect from_op="Read Database" from_port="output" to_op="Set Role (3)" to_port="example set input"/>
<connect from_op="Set Role (3)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Validation" from_port="training" to_port="result 1"/>
<connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
<connect from_op="Read Database (2)" from_port="output" to_op="Set Role (4)" to_port="example set input"/>
<connect from_op="Set Role (4)" from_port="example set output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>

Find more posts tagged with

AI Studio

Twitter

Sentiment Analysis

Comments

There are no comments yet