Topic extraction on Rapidminer

BadBoy20 · June 2015

Hello everyone. I am new to rapidminer.

I've been doing the googling but I haven't found a way to do this yet. Is there a way for rapidminer to detect the topic of a bunch of documents and extract it? Could there be a way to extract the similarity of each document and match how well it matches with a specific keyword. And if there is, could someone write that or link me to such a topic? thanks

DocMusher · June 2015

Hi,
Although I am no expert in text mining, your question can be solved by following the normal pattern as proposed for instance http://vancouverdata.blogspot.be/2010/11/text-analytics-with-rapidminer-loading.html. The topic of a document is related to the tags if available or to the key words you quantified by text mining.
Cheers
Sven

BadBoy20 · June 2015

Is there a way to find how closely a document is to a certain topic? so lets say I have a document about shang-hai and it mentions china a few times. i want to see if said document relates to china and how closely they relate?

MartinLiebig · June 2015

Hi,

there are most likely solutions for you inside rapidminer. I would say there are basicly three ways to go:

- Supervised learning

If you have documents with a Tag (e.g. China) you can go for supervised learning and built a model on each tag which detects the different topics. If you have tagged data, i would go this way. The tutorial above should help you with this

- Clustering
If you do not have tagged examples, you can go for clustering. Then you group together similar things. Most likely you want to use either K-Means or K-Medoids for this task. The problem is here: How many Topics do we search for? How to interpret the results? And of course for tags: A text might be in more than one topic (E.g. Hotel and China).

- Simple similiarty
You can calculate a similarity between two texts using cross distances. Might be helpful in a lot of cases.

Cheers,
Martin

BadBoy20 · June 2015

Thank you for that reply. Supervised learning with tags is out of the question because there are no tags. Simple similarity would be way too slow. I think my best bet is to use clustering, which I do have experience in from before.

The type of data analysis that I am doing is downloading 1000s of documents from a database by doing a headline (heading) search. Thing is, just because the heading has a certain word in it, might not mean that the topic is about that, hence the topic search. The idea that I have with clustering is to use rapidminer to cluster using a suitable value of k and then taking the cluster that has the most amount of objects as the most topical one. Reasoning for this is, let's say, if a database of 10000 documents all have the word "china" in the title, then the cluster that is most closely related together probably has something to do with the heading/search term. The type of documents is financial. I want to ask you from your experience, if this is a viable way to interpret the topic of financial documents through clustering. Thank you for your advice.

Cheers,
BadBoy20

MartinLiebig · June 2015

Hi again,

a small tip: It is often useful to add a supervised learning feature selection after your clustering. The result is: Which words make this cluster different from the others? I would do a one vs all strategy here.

cheers,
Martin

Topic extraction on Rapidminer

Answers

Categories