Text Classification using Text Plugin

pser
pser New Altair Community Member
edited November 5 in Community Q&A

Hi,

I am trying to classify texts stored in a database. I'd like to describe some of the problems I experienced and questions that came up. Since they adress different topics I decided to split the post into three parts. In this one I ask for your opinion: How would you design an experiment for text classification with RapidMiner? If anyone has built a similar experiment I would be very grateful if he could describe the setup he used.

The setup I have in mind at the moment is something like this:

<operator name="Root" class="Process" expanded="yes">
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_url" value="www.example.net"/>
        <parameter key="username" value="example"/>
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="no">
        <parameter key="default_content_encoding" value="UTF-8"/>
        <parameter key="default_content_type" value="html"/>
        <parameter key="filter_nominal_attributes" value="true"/>
        <parameter key="input_word_list" value="example.wordlist"/>
        <list key="namespaces">
        </list>
        <parameter key="prune_above" value="5%"/>
        <parameter key="prune_below" value="3"/>
        <parameter key="remove_original_attributes" value="true"/>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="GermanStopwordFilter" class="GermanStopwordFilter">
        </operator>
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
            <parameter key="max_chars" value="25"/>
            <parameter key="min_chars" value="3"/>
        </operator>
        <operator name="GermanStemmer" class="GermanStemmer">
        </operator>
    </operator>
    <operator name="XValidation" class="XValidation" expanded="yes">
        <parameter key="create_complete_model" value="true"/>
        <parameter key="number_of_validations" value="5"/>
        <operator name="W-NaiveBayesMultinomialUpdateable" class="W-NaiveBayesMultinomialUpdateable">
        </operator>
        <operator name="Testing" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
                <parameter key="keep_model" value="true"/>
            </operator>
            <operator name="ClassificationPerformance" class="ClassificationPerformance">
                <list key="class_weights">
                </list>
                <parameter key="classification_error" value="true"/>
                <parameter key="correlation" value="true"/>
                <parameter key="keep_example_set" value="true"/>
            </operator>
        </operator>
    </operator>
</operator>

This is just the part for learning the model. Of course normally a part where the model is applied to unlabeled data would follow. Later on I'd like to create the wordlist from the database entries (at the moment I work with a given wordlist) and use the UpdateModel operator to update the model incrementally with new labeled data. More about this in my other posts in "Problems and Support".

Answers