Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
"Text Mining / Too slow"
veve
Hello,
I'm having a process in which I'm
processing text from data
(prunning below 3% and above 40% , vector TF-IDF) that implies stemming (snowball), tokenize, uppercase, stop words..
My data is an example set of
about 800 000 lines
and I'm treating 3 text attributes.
The attributes:
- First one: has several words
- Second one: has none or 2-3 words
- Third one: has about 300 words
I'm having a 15.5 GB for my machine and 12GB for RapidMiner.
My process treated
20 000 lines in 3 hours and a half.
..so I estimate that the process should take
6 days and a half
. (Which is not really acceptable?)
1.
Are there any ways in optimizing a text processing process?
2. Does this seem to you that I have a problem in my process? (normally I followed the tutorials, it doesn’t have anything of really special)
3. Are there any benchmark studyies on the speed of rapidminer?
Thank you in advance,
Best regards,
Find more posts tagged with
AI Studio
Text Mining + NLP
Accepted answers
All comments
ighyboo
Have you tried the "Parallel processing extension"? Once installed you will notice a flag in the "process document" operator, that allows parallelization of the vector creation. If you have a multi-core machine that should speed up quite a bit things.
One thing I noticed though is that if you have big data sets it might saturate all your resources and it did happen to me that the process crashed (it might be just because I have a very old machine with 2GB and 4 cores).
Anyhow.. to avoid that I found much more useful to work on data coming from a DB... So what I normally do is:
Load my data into a DB table using RapidMiner
Use the StreamDB operator to feed things into the parallelized "process documents"
write my results on another DB table.
Hope this helps
Igor
veve
Hello,
I'm answering a bit late but, yes I used the parallelisation extension. (thank you for your answer)
However only the process document can be paralelized.. I would need a bit more that that as parallelisation.. are there any other solutions?
Alina
ighyboo
The only other option I can think of is to run RM on an Amazon cloud machine with lots of cores, not sure what's your current config and what's available on AWS at the moment..
If you search the forum there are some good tutorial on how to set up RM on AWS
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups