Text Mining / Clustering / Label Prediction
Hello there,
i am playing arround with some "Text Processing". I've got a collection of about 1000 articles on sport (exspecially soccer/football) news collected from different RSS Feeds.
To start with an good basis I catgorized them all manually into 7 categories. That leads to following distribution (in "german"):
label | count | % |
Teamnews | 430 | 37,01 |
Rest | 166 | 14,29 |
Transfers | 143 | 12,31 |
Skandal | 141 | 12,13 |
Verletzung | 124 | 10,67 |
Management | 99 | 8,52 |
Liganews | 59 | 5,08 |
Summe | 1162 | 100 |
My aim now is to set up a prediction model that will categorize future articels by its own.
That's where i stuck a little bit. Basically i'll do the following text processing:
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
<operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="136">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="none"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="German"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="7.2.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="246" y="34">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="25"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="7.2.000" expanded="true" height="68" name="Transform Cases (3)" width="90" x="380" y="34">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.2.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="514" y="34">
<parameter key="stop_word_list" value="Standard"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.2.000" expanded="true" height="68" name="Filter Stopwords (3)" width="90" x="648" y="34"/>
<operator activated="true" class="open_file" compatibility="7.2.001" expanded="true" height="68" name="Open File (2)" width="90" x="715" y="136">
<parameter key="resource_type" value="file"/>
<parameter key="filename" value="C:\Master\RSS\stoplist_manuell.txt"/>
</operator>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.2.000" expanded="true" height="82" name="Filter Stopwords (4)" width="90" x="849" y="34">
<parameter key="case_sensitive" value="false"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<operator activated="true" class="open_file" compatibility="7.2.001" expanded="true" height="68" name="Open File (3)" width="90" x="916" y="136">
<parameter key="resource_type" value="file"/>
<parameter key="filename" value="C:\Master\RSS\stoplist_manuell_begriffe_aller_kategorien.txt"/>
</operator>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.2.000" expanded="true" height="82" name="Filter Stopwords (5)" width="90" x="1050" y="34">
<parameter key="case_sensitive" value="false"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
<connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Stopwords (3)" to_port="document"/>
<connect from_op="Filter Stopwords (3)" from_port="document" to_op="Filter Stopwords (4)" to_port="document"/>
<connect from_op="Open File (2)" from_port="file" to_op="Filter Stopwords (4)" to_port="file"/>
<connect from_op="Filter Stopwords (4)" from_port="document" to_op="Filter Stopwords (5)" to_port="document"/>
<connect from_op="Open File (3)" from_port="file" to_op="Filter Stopwords (5)" to_port="file"/>
<connect from_op="Filter Stopwords (5)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
</process>
In another process I filtered the labels and checked the created WordLists and was satisfied with the results. So it regnozied the most "important" words for every label.
I stored them in an mysql db. I also created a top50 wordlist wich includes the 50 most used words of a label. But do not use both lists right now.
But back to my current problem. To create a model I choose the X-Validation Operator and tried different classification learners (like: Naive Bayes, k-NN, ID3 and Decision Tree).
Because the results of the performance Operator in all cases where so disappointing, i also used "optimize parameters" operator. Unfortunatelly without positive success.
For example i got an accuracy of 12,48% in my k-NN prediction model.
Here is an example output:
accuracy: 12.48% +/- 0.59% (mikro: 12.48%)
true Skandal | true Management | true Transfers | true Verletzung | true Teamnews | true Rest | true Liganews | class precision | |
pred. Skandal | 141 | 98 | 142 | 124 | 430 | 161 | 58 | 12.22% |
pred. Management | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0.00% |
pred. Transfers | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.00% |
pred. Verletzung | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00% |
pred. Teamnews | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00% |
pred. Rest | 0 | 0 | 0 | 0 | 0 | 4 | 1 | 80.00% |
pred. Liganews | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.00% |
class recall | 100.00% | 0.00% | 0.00% | 0.00% | 0.00% | 2.41% | 0.00% |
Tests with reducing the number of articles in label "Teamnews" to #150 to get an better distribution weren't successfull too.
So is there any hint or tip how i can increase my accuracy to something higher than 70%?
Is it a mistake in previous text processing steps?
Should i use my stored wordlists for each categorie instead of the whole articels?
Or is this the completly wrong way of doing it?
If you need any more information, please let me know.
Thanks.
Best,
David
Hi,
quick thought: Have you tried a Linear SVM in a Polynominal by Binominal Classification operator?
~Martin