I'm attempting to set up a mult-label (not just multi-class!) text classification experiment. To give you an idea: I have a data set of text documents, and each document can belong to one or more classes. Think blog posts with multiple topic tags. I would like to train and evaluate a machine learner on this data set.
My documents are stored in directories named after all applicable labels, much like below:
sports_events
> article1.txt
> article2.txt
politics_events
> article3.txt
politics
> article4.txt
...
So far, I've managed to turn my input documents into word vectors using "Process Documents from Files" and a combination of tokenization, stemming and filtering. But I have several questions:
1. How do I make sure Rapidminer understands the labels I input in the "text directories" list (in the "Process Documents from Files" block) are multiple labels, and not just one big agglutinated label? The "sports, events" label should become "sports" AND "events". Just using commas in the class name apparently doesn't work.
Disregarding this problem for a while, I also tried exporting the generated feature vectors into a sparse format I can feed to libSVM externally. Which brings me to question 2:
2. Using the "Write Special" block, I'm using the following format to attempt to write sparse vectors:
$l $s[ ][:]
However, the label in the output is the nominal label, not the integer mapping that libSVM would require. How do I write the integer instead of the nominal label?
And finally:
3. I would like to write the wordlist resulting from all the tokenization, stemming and filtering etc. to a file. This file should include at least the feature index and the matching realization. So for instance:
1: germany
2: bankers
3: a
...
Even more ideal would be to write kind of extended sparse feature vectors, where each index:value pair is preceded by its realization in the text:
politics,events germany 1:0.0012 a 3: 0.0310 ...
politics germany bankers 2: 0.0008 a 3: 0.0020 ...
Is it possible to do this? If so, how? The only way I've been able to store the wordlist is with the "Write" block, which produces an unwieldy XML file...
Any help from more experienced RapidMiners would be greatly appreciated!