Plain Text Classification/Clustering

rtaank
rtaank New Altair Community Member
edited November 5 in Community Q&A
Hi all,

This is the scenario.

I have an input text file containing many thousand paragraphs of comments made by different people in plain engligh. Each person's comment or statement is basically one paragraph, separated by a \n of course.

I want to read in this single file and then for rapidminer to be able to classify each paragraph within the file to a particular cluster or topic. I am aware of the fact that rapidminer will expect me to specify how many clusters or unique classifications i want up front, this is fine although ideally i would like rapidminer to determine this for me based on the input file.

I have installed the text plugin for rapidminer and am using the TextInput to read the single input file, however i am having difficulty getting rapidminer to detect each unique paragraph within the file as one example of data - any ideas on how this can be done?

Secondly, i would like to know which type of learning is the most suitable for my problem above, unsupervised or supervised?

Finally, upon deciding which type of learning is the best suited to this task, can somebody then suggest which algorithm/s are designed to do natural english language classification best?

My plan is to create a learner (model) that can then easily be applied to future comments as and when they occur.

Thanks in advance for your time.

Ritesh

Answers

  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    for tasks like this you probably can use the operator "Segmenter" which is also part of the text plugin.

    Cheers,
    Ingo
  • rtaank
    rtaank New Altair Community Member
    Hi thanks for that response.

    When you say 'segmentation' are you referring to the problem of reading in the text file itself, or is this the actual learning you are referring to?
  • IngoRM
    IngoRM New Altair Community Member
    I mean the reading. The segmenter can be used to build the parts of the single text file and break it down into lots of smaller ones, one for each paragraph. Then you can apply the learned model on each of those texts.

    Cheers,
    Ingo