text mining and words counting problem
Hi
I'm new to rapidminer and I have an analysis now with several txt document. Let's say I have get the most 20 frequently appear words and I want to know (and only know) how many times they show up in each document, can some one give me some ideas?
Also I have a problem that I find "united", "states" and "united_states" all appear in my result but I can't just replace them because not all "united" are related to "united states". How can I drag those "united_states" out without counting on "united" and "states"?
Thanks
Patrick
Answers
-
For your first question, when you use Process Documents and supply a specific wordlist to use (your 20 words) and then compute the word vector using Term Occurrences.
For your second question, you can use Generate N-Grams after you Tokenize (and do other text preprocessing) which will give you a separate token for "united_states" than either "united" or "states".
0 -
Thanks for the reply!
I have already used term occurace but that gave me overall occurace for my word and I want to know the word occurace in each document(I have about 50 files).
For second question, is that means those "united" and "states" are not related to "united_states"?
Patrick
0 -
I looked up into ducoments and it seems when I use n-Gram opperator all word no matter if they are related, that means I need a filter or purne for those words I think? But how?
0 -
You might want to post your process XML (see the instructions in the right sidebar), since the count should be generated for each document assuming each document is a separate entity in your input data. Do you have the "create word vector" parameter checked?
The single counts are not exclusive of the n-gram, but the exclusive uses can be easily calculated via subtraction. So if there are 10 total occurrences of "united" and 6 occurrences of "united_states" then you know that 4 of the "united" occurrences were not associated with "united_states".
1 -
I found that stopwords(dictionary) can do the trick by manually add words I don't need after all in process documents. For a small case I'm doing it's enough but I'll still look for operators may deal with this problem.
Thank you.
0 -