Text Mining - Document Similarity & Clustering

AB9200
AB9200 New Altair Community Member
edited November 5 in Community Q&A

Hi everyone

I'm new to this forum and to Rapidminer and text mining as well, so I need your help: 

I have a large number of documents (.txt) each one containing a specific question for solving a problem, and the relative answer.

My objective is given a new question to identify the closest ones (all the questions are in italian) in order to suggest the possible solution according to the answers given to the other similar questions.

I have downloaded the Text Mining Extencion and I imagine I have to use the "Process Document from files" operator (Tokenize, Filter Stopwords( Italian), Transform Cases, Stem...) first and than probably use "Document to Similarity" and "Clustering" operators.

Could you please give me some hints?

 

Thanks a lot!


Answers

  • sgenzer
    sgenzer
    Altair Employee

    hello @AB9200 welcome to the community.

     

    So this sounds like a classic text mining classification machine learning problem. You want the algorithm to take a new question and classify it based on how it has "learned" how to classify similar questions before. In order to do this, you need a "training set" of questions that you have classified by other means (can be manually). Once you have a training set, you can use one of the "Process Documents" operators to generate TF-IDF word vectors to build your ML model. There are good resources in our Training section about how to build ML classification models and many resources on text mining on our YouTube channel.

     

    Scott

     

  • AB9200
    AB9200 New Altair Community Member

    Hi @sgenzer,

     

    thank you so much for the reply. Exactly that is what I would like to do but I am looking for a way to solve the problem without the use of a "training set". Is it possible? Maybe doing some clustering, calculating document similarity or using top modeling...I dont't know exactly.

     

    Thank you again.

     

     

  • sgenzer
    sgenzer
    Altair Employee

    hello @AB9200 - so, to quote Euclid: "There is no royal road to geometry." In other words, sometimes you just need to roll up your sleeves and put in the time to get a good solution.

     

    If you want to look at an unsupervised approach, I would recommend watching my recent webinar on topic analysis using the new LDA operator. I walk you through how to do this step-by-step.


    Scott

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi all,

     

    Euclid would have made a good ......data scientist !!!

     

    Regards,

     

    Lionel