Text analysis, word count
I am trying to count the number of specific words in pdf files, which works fine in general (operator create document and process document to create the list of words I am looking for and operator process documents from files to read in the pdf files). But I have two specific questions/problems:
1. What do I have to do when I want that it does not only count the exact word but all words starting with the expression? For example: I want all words starting with "risk". So it should not only count the word "risk", but also "risks", "risky" and so on all together.
2. What do I have to do when I want that it counts two specific words in a row? For example: I want to count all occurences of "liquidity risk", not "liquidity" or "risk" alone. Also, then it shouldn't add this occurence of the term "risk" to the first search with all words starting with risks.
Thank you so much in advance for your help!!
Find more posts tagged with
thank you very much for your quick and helpful response!
Stem (Dictionary) really solved my first problem.
But I am still having trouble with searching for two words in a row, even with 2_grams. Here again a concrete example: I am looking for the term reputation.* risk.* (reputational risk, reputational risks, reputation risk, reputational risks). So I create a document with the words "reputation" and "risk", process it (tokenize, transform cases and generate n-grams). Then I process the pdf-files (tokenize, transform cases, generate n-grams and stem (dictionary) so that everything starting with reputation is reduced to reputation and everything starting with risk is reduced to risk). But the problem now is that the output shows me 0 counts for "reputation risk", the counts for "reputation" and "risk" alone work, though.
Do you have any suggestions how I can alter/fix the process so that it shows me the right number for "reputation.* risk.*"?
Thank you so much for your efforts!
hello @rajbanokhan - can you please post your XML process so we can see? Please see "READ BEFORE POSTING" pane on the right hand side of your Reply window for instructions.
Scott
hi
i am doing textmining. i use process document from files operator. when i run the process it gives me a list of words but i dont want the whole list of words. i just want select my own words from the list which i want. suppose i want words cat, dog, mouse, table chair. how can i get these words only these words from list.
hello @rajbanokhan so both of those extensions can be found in the marketplace. If you open RapidMiner Studio, you should see a menu at the top called "Extensions". Choose the first item "Marketplace (Updates and Extensions)... Then search for "Text Processing" and "Operator Toolbox".
Scott


<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34"> <parameter key="text" value="hi how i find or count the total number of words in one document and then in second and then third and so on?"/> </operator> <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/> <operator activated="true" class="text:extract_token_number" compatibility="8.1.000" expanded="true" height="68" name="Extract Token Number" width="90" x="313" y="34"/> <connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Extract Token Number" to_port="document"/> <connect from_op="Extract Token Number" from_port="document" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Scott
for 1: Have a look at Stem (Dictionary) - that should help.
for 2: I guess a simple replace dictionary would do the trick? Otherwise i would recommend to use 2_grams.
~Martin