"Text Mining / Too slow"

Question

Hello,

I'm having a process in which I'm processing text from data(prunning below 3% and above 40% , vector TF-IDF) that implies stemming (snowball), tokenize, uppercase, stop words..

My data is an example set of about 800 000 lines and I'm treating 3 text attributes.

The attributes:
-	First one: has several words
-	Second one: has none or 2-3 words
-	Third one: has about 300 words

I'm having a 15.5 GB for my machine and 12GB for RapidMiner.

My process treated 20 000 lines in 3 hours and a half...so I estimate that the process should take 6 days and a half. (Which is not really acceptable?)

1.	Are there any ways in optimizing a text processing process?
2.  Does this seem to you that I have a problem in my process? (normally I followed the tutorials, it doesn’t have anything of really special)
3.  Are there any benchmark studyies on the speed of rapidminer?

Thank you in advance,

Best regards,

ighyboo · Answer

The only other option I can think of is to run RM on an Amazon cloud machine with lots of cores, not sure what's your current config and what's available on AWS at the moment..

If you search the forum there are some good tutorial on how to set up RM on AWS ;)

veve · Answer

Hello,

I'm answering a bit late but, yes I used the parallelisation extension. (thank you for your answer)

However only the process document can be paralelized.. I would need a bit more that that as parallelisation.. are there any other solutions?

Alina

ighyboo · Answer

Have you tried the "Parallel processing extension"? Once installed you will notice a flag in the "process document" operator, that allows parallelization of the vector creation. If you have a multi-core machine that should speed up quite a bit things.

One thing I noticed though is that if you have big data sets it might saturate all your resources and it did happen to me that the process crashed (it might be just because I have a very old machine with 2GB and 4 cores).
Anyhow.. to avoid that I found much more useful to work on data coming from a DB... So what I normally do is:

Load my data into a DB table using RapidMiner
Use the StreamDB operator to feed things into the parallelized "process documents"
write my results on another DB table.

Hope this helps :)
Igor