"Grouping Text Files"

Question

Hello,

Next challenge in my attempt to learn RM.

I have a collection of text documents. (Maybe 1,000)

IF possible, I would like to use RM for the following:

1) Automaticaly cluster them.  (I've seen great screenshots of the results, but have no idea how to do it.)

2) Do some kind of "best feature" extraction - Use TFIDF or other algorithm to find significant 2-word and 3-word features

3) Maybe do some kind of sentiment analysis.  (I read a great press release about how RM was used for looking and consumer opinion of a laundry detergent.  That is amazing.  How was this don.)

Thanks!!

-N

noah977 · Answer

Sebastian,

Thank you for the explanation.

I guess I should explain more about my goals and why I can't specify the number of clusters in advace.

I am looking at a batch of documents.  Perhaps 1000-10000.  My goal is to use RM to find common "themes" amongst the documents.

For example:
1) Cluster of 125 documents all with highly weighted phrases of "litigation", "Product", "injury"
2) Cluster of 57 documents with features of "Announcement", "Earnings", "Friday"
3) Cluster of 357  documents with features of "Press Release", "merger with IBM", "stock price"

My HOPE was that I could use RM to generate good TFIDF weights for tokens in the documents and then group them accordingly.  The logic would be that it would form groups of documents with a similarity score > X  (X would be an adjustable variable.)

Is this possible/

land · Answer

Hi,
if you try to specify "necessary" you will understand, that there cannot any reasonable criterion for selecting the number of clusters automatically. All programms doing that just use one heuristic doing something that might turn out to be good or to be bad. This depends on the circumstances and the problem. RapidMiner does not hide this problematic and forces the user to think about the needed number of clusters. 
If you want to learm about clustering, check the samples of rapidMiner itself...

Greetings, 
   Sebastian

noah977 · Answer

OK,

I took your advice and looked through the examples with the text plugin.  I understand how to implement the pluging, load pages, create vector models, etc.

My next question is about clustering.  If I choose "K-means", I must define the number of clusters in advance.  Ideally, I would like to feed a number of documents into RM and then get back as many clusters as necessary.  Is there some other tool that intelligently looks at the data (vectors from documents) and creates as many clusters as "necessary" to represent the data?

Secondly What kind of output options do I have.  Again, the ideal would be a list of documents for each cluster along with the key features of the cluster.  Perhaps to a text file?

Thanks!!!!

-N