text mining and words counting problem

PatrickHou
PatrickHou New Altair Community Member
edited November 5 in Community Q&A

Hi 

 

I'm new to rapidminer and I have an analysis now with several txt document. Let's say I have get the most 20 frequently appear words and I want to know (and only know) how many times they show up in each document, can some one give me some ideas?

 

Also I have a problem that I find "united", "states" and "united_states" all appear in my result but I can't just replace them because not all "united" are related to "united states". How can I drag those "united_states" out without counting on "united" and "states"?

 

Thanks 

Patrick

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member

    For your first question, when you use Process Documents and supply a specific wordlist to use (your 20 words) and then compute the word vector using Term Occurrences.

    For your second question, you can use Generate N-Grams after you Tokenize (and do other text preprocessing) which will give you a separate token for "united_states" than either "united" or "states".

     

  • PatrickHou
    PatrickHou New Altair Community Member

    Thanks for the reply!

     

    I have already used term occurace but that gave me overall occurace for my word and I want to know the word occurace in each document(I have about 50 files).

     

    For second question, is that means those "united" and "states" are not related to "united_states"?

     

    Patrick

  • PatrickHou
    PatrickHou New Altair Community Member

    I looked up into ducoments and it seems when I use n-Gram opperator all word no matter if they are related, that means I need a filter or purne for those words I think? But how?

  • Telcontar120
    Telcontar120 New Altair Community Member

    You might want to post your process XML (see the instructions in the right sidebar), since the count should be generated for each document assuming each document is a separate entity in your input data.  Do you have the "create word vector" parameter checked?

    The single counts are not exclusive of the n-gram, but the exclusive uses can be easily calculated via subtraction.  So if there are 10 total occurrences of "united" and 6 occurrences of "united_states" then you know that 4 of the "united" occurrences were not associated with "united_states".

  • PatrickHou
    PatrickHou New Altair Community Member

    I found that stopwords(dictionary) can do the trick by manually add words I don't need after all in process documents. For a small case I'm doing it's enough but I'll still look for operators may deal with this problem.

     

    Thank you.

  • sgenzer
    sgenzer
    Altair Employee