Text Mining - Document Similarity & Clustering
I'm new to this forum and to Rapidminer and text mining as well, so I need your help:
Answers
-
hello @AB9200 welcome to the community.
So this sounds like a classic text mining classification machine learning problem. You want the algorithm to take a new question and classify it based on how it has "learned" how to classify similar questions before. In order to do this, you need a "training set" of questions that you have classified by other means (can be manually). Once you have a training set, you can use one of the "Process Documents" operators to generate TF-IDF word vectors to build your ML model. There are good resources in our Training section about how to build ML classification models and many resources on text mining on our YouTube channel.
Scott
2 -
Hi @sgenzer,
thank you so much for the reply. Exactly that is what I would like to do but I am looking for a way to solve the problem without the use of a "training set". Is it possible? Maybe doing some clustering, calculating document similarity or using top modeling...I dont't know exactly.
Thank you again.
0 -
hello @AB9200 - so, to quote Euclid: "There is no royal road to geometry." In other words, sometimes you just need to roll up your sleeves and put in the time to get a good solution.
If you want to look at an unsupervised approach, I would recommend watching my recent webinar on topic analysis using the new LDA operator. I walk you through how to do this step-by-step.
Scott2 -
Hi all,
Euclid would have made a good ......data scientist !!!
Regards,
Lionel
1