"Sentence analysis - word count for each sentence"
Hello together,
I started with RapidMiner a few weeks agos (a fine tool btw) an I have the following situation: I have documents which I want to split into sentences and then count the number of words in each sentence so that I can check against a criteria (15 to 20 words) if the setence matches it - resulting in a positive or negative mark.
Later I will use a model and separate datasets (all documents vs. a new one) to check for the differences, i.e. compare the length of sentences against the training data. Here I though of k-NN and automatic classification, yet I'm not that far...
So far I managed to split my documents into sentences and I have them in a repository - so far, so good. However I can't find a way how to tell RapidMiner to count the words in each row (each row contains a sentence) - I mostly end up with getting the word frequency over all documents, which is not what I would like to have.
Does somebody know, how this can be achieved?
Thanks in advance.
Oliver
Best Answer
-
Sure, now that you have each sentence as its own document, you can process them using one of the many Process Document operators, depending on how they are stored. Inside that operator, you'll want to Tokenize it for words (typically using non-letters to tokenize) and then use the operator "Extract Token Number" to do exactly what you are looking for. That will get the number of tokens per document (sentence, in your case), which will be added to the data. You can even use "Aggregate Token Length" to find some other interesting token metadata, like average number of characters per token, token variance, min/max, etc.
0
Answers
-
Sure, now that you have each sentence as its own document, you can process them using one of the many Process Document operators, depending on how they are stored. Inside that operator, you'll want to Tokenize it for words (typically using non-letters to tokenize) and then use the operator "Extract Token Number" to do exactly what you are looking for. That will get the number of tokens per document (sentence, in your case), which will be added to the data. You can even use "Aggregate Token Length" to find some other interesting token metadata, like average number of characters per token, token variance, min/max, etc.
0 -
Thank you - that was actually easier then I thought :-)
Sometimes you don't seee the wood for the trees ;-)
However, is there a way to form an attribute of the counted tokens directly or do I have to do this at another place?
Oliver
1 -
I'm not sure I follow your question--"Extract Token Number" does create the attribute you are looking for.
0 -
Indeed - I just had to use the "include special attributs" in the following "Select Attributes" operator and it showed up...
Thank you very much.
Oliver
0