Calculate number of unique words in text and number of repeating paragraphs

In777
In777 New Altair Community Member
edited November 5 in Community Q&A

How can I calculate the number of unique mentions of each words (tokens without stopwords) in each text document? Besides how can I find the number of repeating sentences or paragraphs? Is there any operators in text mining extension?

Best Answer

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓

    Hi,

     

    you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.

     

    ~Martin

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓

    Hi,

     

    you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.

     

    ~Martin

  • In777
    In777 New Altair Community Member

    Thank you I think that will work. And what about repeating sentences? I tried the similarity measure first, but I have too long documents, so it will not work.

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

     

    Simply tokenize on linguistic sentences and do the same trick as for words.

     

    ~Martin

  • In777
    In777 New Altair Community Member

    Hi Martin,

     

    Thank you for the answer. I have a follow-up question: if the sentences are not complete the same, but very similar (e.g. 2-3 words are changed), how could I find the repeating text parts then?

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi ln777,

    you are always allowed to ask questions - that's what we are here for :). The only question is if we can answer them.

     

    i would create a similarity/synonym dictionary. I would go for worldist to data, take the sentences as an input for a 2nd process documents, tokenize on words and calculate a cross distance on the result. There i would go for a high cosine similarity to define a "synomym". This dictionary can then be used to replace texts in the original document.

     

    ~Martin