Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Classification of short texts

I want to use the textplugin and a support vector machine for text classification. My data consists of mainly short texts, and I have found that I cannot do too much prunning or I will end up having no matching tokens left in the text I want to classify. Futhermore, once I have trained my model I want to apply it to a single text at a time.

My problem is the performance of the setup. I endup with a model with a large number of tokens (50.000+), and typically I want to classify a text with about 10 matching tokens. I use libSVM for the model and I use SingleTextInput for the application process. The problem is that it takes a long time to classify a single short text. Loading the model is fast as long as binary serialization is used, but the tokenization is painfully slow, which is probably due to rapidminer filling zeros into the vector.

My question is therefore: Is there a way around this? Is it possible to used a sparse format which will work with SingleTextInput and an SVM?

Find more posts tagged with

AI Studio

Accepted answers

All comments

fischer

Hi,

If the problem is the non-sparse data representation, then you can try the StringTextInput using an ExampleSet with datamangement=sparse_array. However, I believe that the reason is an inefficiency in the way example sets access Attributes in 4.4. This is already fixed and will be included in 4.5 which is coming soon.

Cheers,
Simon

kochan

OK, I will look forward to the 4.5 release and see if it solves my problem.

Thanks for the quick response.

Andreas

kochan

I have tried my process on a patched version of rapidminer including the new improved way of accessing attributes.

I can see that there has been a huge performance increase in RapidMiner in general. However, the fix does not seem to solve the fundamental problem I have. I get a lot of warnings like:

Kernel model: The given exampleset does not contain a regular attribute......

Basically, I have two sets of attributes a small one containing the tokens from the short text I want to classify and a large one containing all the tokens in the training material. Aparently what RapidMiner does is look up all the attributes from the large set in the small set, while it would have been much more efficient (in my setup) to look up the attributes present in the small set in the large set and then assume that the rest are zero.

Regards,

Andreas

fischer

Hi again,

first of all, note that whether or not you are using a sparse representation depends on how you are generating your example set. From your post I do not know how exactly you are constructing it since I am a bit confused about the two attribute sets you are mentioning.

More importantly, since you are referring to two sets of attributes I think there might be a general problem with your process setup. E.g. if the set of attributes used for classificatioon and the one used for training is disjoint your results will probably be pretty meaningless. Also, the strange warnings you are getting point to a problem in the process setup. Maybe you can just post your process here.

Best,
Simon

kochan

Hi Simon,

I have tried to investigate this problem a bit further and I have come to the conclusion that you are right about there being a problem with my setup and it is probably this problem which is the source of my performance problems.

The essence of my setup is that I have a training process like this:

- DatabaseExampleSource
- StringTextInput
- - StringTokenizer
- LibSVMLearner
- ModelWriter

And an applier process like this:

- ModelLoader
- SingleTextInput
- - StringTokenizer
- ModelApplier

In the applier process the SingleTextInput operator produces an ExampleSet. However, to the best of my knowledge there is no way of specifying that this ExampleSet should have a sparse representation. Also, I have the word "id" in my training material, which causes a problem since SingleTextInput by default creates an Attribute named "id" and automatically renames the old one to "id_". The extra attribute unfortunately implies that the input vector to the ModelApplier operator is one element longer than what LibSVM expects. The only way I have been able to solve this problem is to manually edit the wordlist and change "id" into something else.

All in all I have been able to solve my problems with small workarounds.

Regards,

Andreas