Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Text Mining - Document Similarity/Clustering

sangeet171188

I have a 100K text records, each have a problem description. now want to achieve :

1) Take a count of "Similar" looking problem description. (How to achieve this)

2) Main roadblocker is that it takes a lifetime to run process.

Steps :

Data import-->select attributes-->Process document to Data(tokenize.stopwords,n grams)-->K means/DBSCAN

How can I optimize this to run faster.

Find more posts tagged with

AI Studio

Clustering

Text Mining + NLP

Accepted answers

MartinLiebig

Hi,

what part of the analysis takes long? Clustering or Tokenizing? Do you use RM 7.2+?

~Martin

All comments

MartinLiebig

Hi,

what part of the analysis takes long? Clustering or Tokenizing? Do you use RM 7.2+?

~Martin

sangeet171188

I can make Process documen to Data to run faster, but the Clustering takes away more than a day to run.

I am using community verison 7.5. Please suggest how can I decrease run time. And will enterprise version solve the problem ?

MartinLiebig

Hi,

i think the way to go is to reduce the number of attribute e.g. by pruning.

~Martin

sangeet171188

After TF-IDF anyway we get a lot of attributes, even if I try removing stopwords by dictionary and Pruning, this is taking forever to run.

Want to know how exactly do we proceed to do clustering for a lot of text and in any way an enterprise version solve my problem ?

MartinLiebig

Hi,

how many attributes do you habve if you use percentual pruning with 5,50?

Best,

Martin

sangeet171188

900 approx. regular attributes

MartinLiebig

Hi,

That's simply still a lot for the K-Means. 900x100.000. Either you prune/stem/filter harder or you can go for PCA in front of k-means.

~Martin

sangeet171188

But we are dealing with text documents here.

Attribute will be all text (words,n grams).I Can try to further down it by max 10%. How will PCA help me here, please enlighten.

And So what is the process to cluster large text documents

Thomas_Ott

Yes, but depending on if you're creating bi or tri_grams, you're blowing up the size of your data set and that affects training time.

When you have all those columns (aka attributes), you are creating a highly dimensional data set that the clustering alogrithm has to work hard at to calculate to group together. The less attributes you have the faster it will be.

You could use a larger machine with more memory or a RapidMiner Server on a large server but the best option is to do what @mschmitz said and try to reduce the amount of attributes by PCA, pruning, or reducing the # of n-grams. It's a trade off that you have to carefully think about.

kayman

It all appends on what you want to achieve. If you want your cluster mechanisms to look at your 100K records and nicely group them in pretty detailed clusters you are indeed in for a challenge.

What I therefore typically do is to take this into 2 (or even 3 steps), first one is to get rid of as much as possible indeed, use this as a first filter (call it level 1 or so) and then each of these go agian into the set but now with less aggressive pruning. This way you analyse for instance 10 times 10K articles in a given high level cluster rather than working with the full set.

Also, do not limit yourself to the out of the box stopwords list, it is extremly limited and will therefore have no real impact on your attribute set.

Typical workflow I use is as follows (with the assumption the content is in a language you understand) :

1) Look at top keywords : Just generating a wordlist before you start can give you an extreme amount of knowledge so do not vectorize yet at this stage.

-> All cases to lower (often overlooked, yet so important) -> tokenize on spaces (or whatever preference) -> use some additional things like filtering out numbers or words containing numbers (unless they add value) -> stemming. Porter is a good one and it will reduce readability from your end but the machine doesn't care, it's all bits and bytes in the end.

Then look at the top words of the provided wordlist. You will notice the vast majoraty will not add any value as they are like very generic terms, all of these can be stored in a filter stopwords (dictionairy) operator. You would be suprised of the impact this has. The less generic terms available the easier for a machine it will be to cluster. Humans work the same way, but we skip the generic terms automaticaly when analysing a text, machines will have to learn and that takes indeed an awfull lot of time. So make it easier, provide less generic terms that it will ahve to ignore upfront.

Other things you could do is using the POS (part of speach) operator in the first stage. Typically nouns give the most descriptive value, so removing everything that is not a noun would be an option also while doing the first heavy clustering.

2) Start over again with 1) and look at the results of the wordlist. Once you have the feeling that the words left are in general descriptive and most garbage is gone you start the actual clustering.

3) Now cluster and use your improved process (based on your wordlist analysis) and try to get rid of as much as possible before you start the vector process. This would limit the amount of jobs done at the same time. You can achieve this by just running one operator to do the cleaning, keep the text and use that as the input for a second operator doing the vector generation. this way the second operator just needs to deal with splitting and modelling, as the precleaning is already done upfront. Not sure if this has a big impact but many small ones can make up for a big one.

Play with the prune values, you can either use percentages but typically I would use hard figures. Given that the top words left are indeed important (as verified using the wordlist and extended stopwords list) we should not really focus on the top site (these are anyway just a few attributes that appear often) but on the downside instead, as these are a awfull lot of words that appear in a limited manner. Since they are so seldom it is fai to assume they will not give much value for clustering so you could for instance state that any word appearing less than 20 times can be skipped.