Topic extraction on Rapidminer

BadBoy20
BadBoy20 New Altair Community Member
edited November 5 in Community Q&A
Hello everyone. I am new to rapidminer.

I've been doing the googling but I haven't found a way to do this yet. Is there a way for rapidminer to detect the topic of a bunch of documents and extract it? Could there be a way to extract the similarity of each document and match how well it matches with a specific keyword. And if there is, could someone write that or link me to such a topic? thanks
Tagged:

Answers

  • DocMusher
    DocMusher New Altair Community Member
    Hi,
    Although I am no expert in text mining, your question can be solved by following the normal pattern as proposed for instance http://vancouverdata.blogspot.be/2010/11/text-analytics-with-rapidminer-loading.html. The topic of a document is related to the tags if available or to the key words you quantified by text mining.
    Cheers
    Sven
  • BadBoy20
    BadBoy20 New Altair Community Member
    Is there a way to find how closely a document is to a certain topic? so lets say I have a document about shang-hai and it mentions china a few times. i want to see if said document relates to china and how closely they relate?
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,

    there are most likely solutions for you inside rapidminer. I would say there are basicly three ways to go:

    - Supervised learning

    If you have documents with a Tag (e.g. China) you can go for supervised learning and built a model on each tag which detects the different topics. If you have tagged data, i would go this way. The tutorial above should help you with this

    - Clustering
    If you do not have tagged examples, you can go for clustering. Then you group together similar things. Most likely you want to use either K-Means or K-Medoids for this task. The problem is here: How many Topics do we search for? How to interpret the results? And of course for tags: A text might be in more than one topic (E.g. Hotel and China).

    - Simple similiarty
    You can calculate a similarity between two texts using cross distances. Might be helpful in a lot of cases.

    Cheers,
    Martin
  • BadBoy20
    BadBoy20 New Altair Community Member
    Thank you for that reply. Supervised learning with tags is out of the question because there are no tags. Simple similarity would be way too slow. I think my best bet is to use clustering, which I do have experience in from before.

    The type of data analysis that I am doing is downloading 1000s of documents from a database by doing a headline (heading) search. Thing is, just because the heading has a certain word in it, might not mean that the topic is about that, hence the topic search. The idea that I have with clustering is to use rapidminer to cluster using a suitable value of k and then taking the cluster that has the most amount of objects as the most topical one. Reasoning for this is, let's say,  if a database of 10000 documents all have the word "china" in the title, then the cluster that is most closely related together probably has something to do with the heading/search term. The type of documents is financial. I want to ask you from your experience, if this is a viable way to interpret the topic of financial documents through clustering. Thank you for your advice.

    Cheers,
    BadBoy20
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi again,

    a small tip: It is often useful to add a supervised learning feature selection after your clustering. The result is: Which words make this cluster different from the others? I would do a one vs all strategy here.

    cheers,
    Martin