"Grouping Text Files"

noah977
noah977 New Altair Community Member
edited November 5 in Community Q&A
Hello,

Next challenge in my attempt to learn RM.

I have a collection of text documents. (Maybe 1,000)

IF possible, I would like to use RM for the following:

    1) Automaticaly cluster them.  (I've seen great screenshots of the results, but have no idea how to do it.)

    2) Do some kind of "best feature" extraction - Use TFIDF or other algorithm to find significant 2-word and 3-word features

    3) Maybe do some kind of sentiment analysis.  (I read a great press release about how RM was used for looking and consumer opinion of a laundry detergent.  That is amazing.  How was this don.)

Thanks!!

-N


Answers

  • TobiasMalbrecht
    TobiasMalbrecht New Altair Community Member
    Hi,

    well, all of the things you mentioned are (of course ;)) possible with RM. But explaining all this to you here in general would be tantamount to write a tutorial or actually a book about text mining/sentiment analysis. I am sure you will understand that this is beyond the scope of this forum...

    If you are willing to learn how to set up such text mining and sentiment analysis processes with RM really fast I would highly recommend you to attend one of the training courses we offer. If you are very spontaneous it might be interesting for you to know that there is a training course concerning that topic at the beginning of December. As far as I remember there is also a still place available in the course. If you are interested, feel free to contact us. You may send an email to malbrecht@rapid-i.com and I will provide you with some more details.

    Otherwise there is of course the probability to learn from the example processes which are shipped with the text plugin. However, learning things for yourself of course might take more of your time and might still leave you with open questions ...

    Regards,
    Tobias
  • noah977
    noah977 New Altair Community Member
    Tobias,

    Thank you for the information about your next class.  Unfortunately time and expense prohibit me from attending.  If you ever have a seminar again in California, I would be interested.

    Do you offer any kind of phone consulting?  I would be very helpful to buy one or two hours of time over the phone to discuss some basic project ideas.

  • noah977
    noah977 New Altair Community Member
    OK,

    I took your advice and looked through the examples with the text plugin.  I understand how to implement the pluging, load pages, create vector models, etc.

    My next question is about clustering.  If I choose "K-means", I must define the number of clusters in advance.  Ideally, I would like to feed a number of documents into RM and then get back as many clusters as necessary.  Is there some other tool that intelligently looks at the data (vectors from documents) and creates as many clusters as "necessary" to represent the data?

    Secondly What kind of output options do I have.  Again, the ideal would be a list of documents for each cluster along with the key features of the cluster.  Perhaps to a text file?

    Thanks!!!!

    -N
  • land
    land New Altair Community Member
    Hi,
    if you try to specify "necessary" you will understand, that there cannot any reasonable criterion for selecting the number of clusters automatically. All programms doing that just use one heuristic doing something that might turn out to be good or to be bad. This depends on the circumstances and the problem. RapidMiner does not hide this problematic and forces the user to think about the needed number of clusters.
    If you want to learm about clustering, check the samples of rapidMiner itself...

    Greetings,
      Sebastian
  • noah977
    noah977 New Altair Community Member
    Sebastian,

    Thank you for the explanation.

    I guess I should explain more about my goals and why I can't specify the number of clusters in advace.

    I am looking at a batch of documents.  Perhaps 1000-10000.  My goal is to use RM to find common "themes" amongst the documents.

    For example:
    1) Cluster of 125 documents all with highly weighted phrases of "litigation", "Product", "injury"
    2) Cluster of 57 documents with features of "Announcement", "Earnings", "Friday"
    3) Cluster of 357  documents with features of "Press Release", "merger with IBM", "stock price"

    My HOPE was that I could use RM to generate good TFIDF weights for tokens in the documents and then group them accordingly.  The logic would be that it would form groups of documents with a similarity score > X  (X would be an adjustable variable.)

    Is this possible/