"Multi-label text classification problem"

Coren
Coren New Altair Community Member
edited November 5 in Community Q&A
I'm attempting to set up a mult-label (not just multi-class!) text classification experiment. To give you an idea: I have a data set of text documents, and each document can belong to one or more classes. Think blog posts with multiple topic tags. I would like to train and evaluate a machine learner on this data set.


My documents are stored in directories named after all applicable labels, much like below:
sports_events
> article1.txt
> article2.txt
politics_events
> article3.txt
politics
> article4.txt
...
So far, I've managed to turn my input documents into word vectors using "Process Documents from Files" and a combination of tokenization, stemming and filtering. But I have several questions:

1. How do I make sure Rapidminer understands the labels I input in the "text directories" list (in the "Process Documents from Files" block) are multiple labels, and not just one big agglutinated label? The "sports, events" label should become "sports" AND "events". Just using commas in the class name apparently doesn't work.

Disregarding this problem for a while, I also tried exporting the generated feature vectors into a sparse format I can feed to libSVM externally. Which brings me to question 2:

2. Using the "Write Special" block, I'm using the following format to attempt to write sparse vectors:
$l  $s[ ][:]
However, the label in the output is the nominal label, not the integer mapping that libSVM would require. How do I write the integer instead of the nominal label?

And finally:

3. I would like to write the wordlist resulting from all the tokenization, stemming and filtering etc. to a file. This file should include at least the feature index and the matching realization. So for instance:
1: germany
2: bankers
3: a
...
Even more ideal would be to write kind of extended sparse feature vectors, where each index:value pair is preceded by its realization in the text:
politics,events  germany 1:0.0012 a 3: 0.0310 ...
politics  germany bankers 2: 0.0008 a 3: 0.0020 ...
Is it possible to do this? If so, how? The only way I've been able to store the wordlist is with the "Write" block, which produces an unwieldy XML file...


Any help from more experienced RapidMiners would be greatly appreciated!

Answers

  • Coren
    Coren New Altair Community Member
    Okay, here's an update:

    I managed to set multiple labels by using the "Split" block to split the label attributes on the comma, and by then setting the role of each of the new label columns to label1, label2, etc. So my data set is pretty much ready.

    Now I'm trying to set up the classification. Basically, what I want is to train an SVM classifier for each label in the training set, and each instance in the test set should be evaluated against each SVM model (of course using the appropriate label).
    In a final phase, I want to set a threshold on the output probability of each label so I can determine which labels should be included in the final output.

    I'm already stuck at that first step: I've been able to build models for each label using the "Loop Labels" block containing a "Discretization" and a "libSVM" block. This returns a collection of models.
    I can also make a collection of test example sets using "Loop Labels".
    My question now is: how do I evaluate each test set on its corresponding model? ExampleSet_Collection[0] should be run through Model_Collection[0], ExampleSet_Collection[1] through Model_Collection[1] etc. (Kind of like the zip() operator in Python, if anyone's familiar with it.)

    Here's my unfinished setup as it is. I'd be grateful if someone could help me complete it:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
        <process expanded="true" height="695" width="1058">
          <operator activated="true" class="retrieve" compatibility="5.2.006" expanded="true" height="60" name="TRAIN" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Repo/OSS2011.fold-01.tokenized-stemmed-filtered.train"/>
          </operator>
          <operator activated="true" class="loop_labels" compatibility="5.2.006" expanded="true" height="76" name="Loop Labels" width="90" x="246" y="30">
            <process expanded="true" height="695" width="1058">
              <operator activated="true" class="discretize_by_user_specification" compatibility="5.2.006" expanded="true" height="94" name="Discretize" width="90" x="111" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="label_0"/>
                <parameter key="regular_expression" value="label_0"/>
                <parameter key="include_special_attributes" value="true"/>
                <list key="classes">
                  <parameter key="false" value="0.5"/>
                  <parameter key="true" value="Infinity"/>
                </list>
              </operator>
              <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.2.006" expanded="true" height="76" name="SVM" width="90" x="313" y="30">
                <list key="class_weights"/>
              </operator>
              <connect from_port="example set" to_op="Discretize" to_port="example set input"/>
              <connect from_op="Discretize" from_port="example set output" to_op="SVM" to_port="training set"/>
              <connect from_op="SVM" from_port="model" to_port="out 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="retrieve" compatibility="5.2.006" expanded="true" height="60" name="TEST" width="90" x="45" y="255">
            <parameter key="repository_entry" value="//Repo/OSS2011.fold-01.tokenized-stemmed-filtered.test"/>
          </operator>
          <operator activated="true" class="loop_labels" compatibility="5.2.006" expanded="true" height="76" name="Loop Labels (2)" width="90" x="246" y="255">
            <process expanded="true" height="695" width="1058">
              <connect from_port="example set" to_port="out 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="TRAIN" from_port="output" to_op="Loop Labels" to_port="example set"/>
          <connect from_op="TEST" from_port="output" to_op="Loop Labels (2)" to_port="example set"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>