"Mining standard text and then creating clusters?"

rtaank
rtaank New Altair Community Member
edited November 5 in Community Q&A
Hi there,

I am relatively new to Rapid Miner, and data mining too for that matter.

I have just installed Rapid Miner and have been through the tutorial and studied the various literature.

I wanted to know if it was possible for Rapid Miner to be fed with paragraphs of standard written english text (in a *.dat file), and then for it to parse through all the paragraphs and to identify patterns within the text (i.e. could be certain keywords or phrases that appear to be similar). Then Rapid Miner should decide that there should be x clusters as a result of the parsed text, and it puts (or assigns) each paragraph within the *.dat file to a cluster.

I have heard this it is possible to do this using some form of unsupervised learning model?

Any ideas from the community on how this could be tackled?

I also was having difficult importing text into Rapid Miner using the ExampleSource IO operator, so any guidance here would be highly appreciated too.

Thanks for your time.

Ritesh

Answers

  • land
    land New Altair Community Member
    Hi Ritesh,
    this is possible, but you will need the text mining plugin available at our homepage. You then might translate a paragraph of text into something called bag of words. This then might be used for input into usual learning algorithms, supervised or unsupervised.

    Greetings,
      Sebastian
  • rtaank
    rtaank New Altair Community Member
    Hi thanks for your response, sounds encouraging and promising at the same time!

    Okay i have managed to read in my text files using the Tokenizer operator, and i can see the wordlist in the results. All is good so far.

    The next challenge is the actual classification of these words into clusters by rapidminer...any guidance on how to proceed with this?

    Any help is much appreciated.

    Ritesh
  • land
    land New Altair Community Member
    Hi,
    thats quit simple: After you used the TextInput operator with a tokenizer, you will have an exampleSet containing each text as an example with one attribute per word. This is a normal example, you might handle with any clustering algorithm you like. So take a look at the example processes delivered with rapid miner for clustering, exchange the part for loading the data with the textInput, start and take a look on the results :)

    Greetings,
      Sebastian