Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

text mining and words counting problem

I'm new to rapidminer and I have an analysis now with several txt document. Let's say I have get the most 20 frequently appear words and I want to know (and only know) how many times they show up in each document, can some one give me some ideas?

Also I have a problem that I find "united", "states" and "united_states" all appear in my result but I can't just replace them because not all "united" are related to "united states". How can I drag those "united_states" out without counting on "united" and "states"?

Thanks

Patrick

Find more posts tagged with

AI Studio

Text Mining + NLP

Getting Started

Accepted answers

All comments

Telcontar120

For your first question, when you use Process Documents and supply a specific wordlist to use (your 20 words) and then compute the word vector using Term Occurrences.

For your second question, you can use Generate N-Grams after you Tokenize (and do other text preprocessing) which will give you a separate token for "united_states" than either "united" or "states".

PatrickHou

Thanks for the reply!

I have already used term occurace but that gave me overall occurace for my word and I want to know the word occurace in each document(I have about 50 files).

For second question, is that means those "united" and "states" are not related to "united_states"?

Patrick

PatrickHou

I looked up into ducoments and it seems when I use n-Gram opperator all word no matter if they are related, that means I need a filter or purne for those words I think? But how?

Telcontar120

You might want to post your process XML (see the instructions in the right sidebar), since the count should be generated for each document assuming each document is a separate entity in your input data. Do you have the "create word vector" parameter checked?

The single counts are not exclusive of the n-gram, but the exclusive uses can be easily calculated via subtraction. So if there are 10 total occurrences of "united" and 6 occurrences of "united_states" then you know that 4 of the "united" occurrences were not associated with "united_states".

PatrickHou

I found that stopwords(dictionary) can do the trick by manually add words I don't need after all in process documents. For a small case I'm doing it's enough but I'll still look for operators may deal with this problem.

Thank you.

sgenzer