🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

text mining and words counting problem

User: "PatrickHou"
New Altair Community Member
Updated by Jocelyn

Hi 

 

I'm new to rapidminer and I have an analysis now with several txt document. Let's say I have get the most 20 frequently appear words and I want to know (and only know) how many times they show up in each document, can some one give me some ideas?

 

Also I have a problem that I find "united", "states" and "united_states" all appear in my result but I can't just replace them because not all "united" are related to "united states". How can I drag those "united_states" out without counting on "united" and "states"?

 

Thanks 

Patrick

Sort by:
1 - 6 of 61
    User: "Telcontar120"
    New Altair Community Member

    For your first question, when you use Process Documents and supply a specific wordlist to use (your 20 words) and then compute the word vector using Term Occurrences.

    For your second question, you can use Generate N-Grams after you Tokenize (and do other text preprocessing) which will give you a separate token for "united_states" than either "united" or "states".

     

    User: "PatrickHou"
    New Altair Community Member
    OP

    Thanks for the reply!

     

    I have already used term occurace but that gave me overall occurace for my word and I want to know the word occurace in each document(I have about 50 files).

     

    For second question, is that means those "united" and "states" are not related to "united_states"?

     

    Patrick

    User: "PatrickHou"
    New Altair Community Member
    OP

    I looked up into ducoments and it seems when I use n-Gram opperator all word no matter if they are related, that means I need a filter or purne for those words I think? But how?

    User: "Telcontar120"
    New Altair Community Member

    You might want to post your process XML (see the instructions in the right sidebar), since the count should be generated for each document assuming each document is a separate entity in your input data.  Do you have the "create word vector" parameter checked?

    The single counts are not exclusive of the n-gram, but the exclusive uses can be easily calculated via subtraction.  So if there are 10 total occurrences of "united" and 6 occurrences of "united_states" then you know that 4 of the "united" occurrences were not associated with "united_states".

    User: "PatrickHou"
    New Altair Community Member
    OP

    I found that stopwords(dictionary) can do the trick by manually add words I don't need after all in process documents. For a small case I'm doing it's enough but I'll still look for operators may deal with this problem.

     

    Thank you.

    User: "sgenzer"
    Altair Employee