Calculate number of unique words in text and number of repeating paragraphs
How can I calculate the number of unique mentions of each words (tokens without stopwords) in each text document? Besides how can I find the number of repeating sentences or paragraphs? Is there any operators in text mining extension?
Best Answer
-
Hi,
you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.
~Martin
0
Answers
-
Hi,
you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.
~Martin
0 -
Thank you I think that will work. And what about repeating sentences? I tried the similarity measure first, but I have too long documents, so it will not work.
0 -
Hi,
Simply tokenize on linguistic sentences and do the same trick as for words.
~Martin
0 -
Hi Martin,
Thank you for the answer. I have a follow-up question: if the sentences are not complete the same, but very similar (e.g. 2-3 words are changed), how could I find the repeating text parts then?
0 -
Hi ln777,
you are always allowed to ask questions - that's what we are here for . The only question is if we can answer them.
i would create a similarity/synonym dictionary. I would go for worldist to data, take the sentences as an input for a 2nd process documents, tokenize on words and calculate a cross distance on the result. There i would go for a high cosine similarity to define a "synomym". This dictionary can then be used to replace texts in the original document.
~Martin
0