Text Classification using Text Plugin

New Altair Community Member

Oct 21, 2008

Updated Nov 5, 2024 by Jocelyn

Hi,

I am trying to classify texts stored in a database. I'd like to describe some of the problems I experienced and questions that came up. Since they adress different topics I decided to split the post into three parts. In this one I ask for your opinion: How would you design an experiment for text classification with RapidMiner? If anyone has built a similar experiment I would be very grateful if he could describe the setup he used.

The setup I have in mind at the moment is something like this:

<operator name="Root" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_url" value="www.example.net"/>
<parameter key="username" value="example"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="no">
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_type" value="html"/>
<parameter key="filter_nominal_attributes" value="true"/>
<parameter key="input_word_list" value="example.wordlist"/>
<list key="namespaces">
</list>
<parameter key="prune_above" value="5%"/>
<parameter key="prune_below" value="3"/>
<parameter key="remove_original_attributes" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="GermanStopwordFilter" class="GermanStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="max_chars" value="25"/>
<parameter key="min_chars" value="3"/>
</operator>
<operator name="GermanStemmer" class="GermanStemmer">
</operator>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="create_complete_model" value="true"/>
<parameter key="number_of_validations" value="5"/>
<operator name="W-NaiveBayesMultinomialUpdateable" class="W-NaiveBayesMultinomialUpdateable">
</operator>
<operator name="Testing" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
<parameter key="keep_model" value="true"/>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<list key="class_weights">
</list>
<parameter key="classification_error" value="true"/>
<parameter key="correlation" value="true"/>
<parameter key="keep_example_set" value="true"/>
</operator>
</operator>
</operator>
</operator>

This is just the part for learning the model. Of course normally a part where the model is applied to unlabeled data would follow. Later on I'd like to create the wordlist from the database entries (at the moment I work with a given wordlist) and use the UpdateModel operator to update the model incrementally with new labeled data. More about this in my other posts in "Problems and Support".