Calculate number of unique words in text and number of repeating paragraphs

In777
In777 New Altair Community Member
edited November 2024 in Community Q&A

How can I calculate the number of unique mentions of each words (tokens without stopwords) in each text document? Besides how can I find the number of repeating sentences or paragraphs? Is there any operators in text mining extension?

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answer

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓

    Hi,

     

    you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.

     

    ~Martin

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓

    Hi,

     

    you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.

     

    ~Martin

  • In777
    In777 New Altair Community Member

    Thank you I think that will work. And what about repeating sentences? I tried the similarity measure first, but I have too long documents, so it will not work.

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

     

    Simply tokenize on linguistic sentences and do the same trick as for words.

     

    ~Martin

  • In777
    In777 New Altair Community Member

    Hi Martin,

     

    Thank you for the answer. I have a follow-up question: if the sentences are not complete the same, but very similar (e.g. 2-3 words are changed), how could I find the repeating text parts then?

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi ln777,

    you are always allowed to ask questions - that's what we are here for :). The only question is if we can answer them.

     

    i would create a similarity/synonym dictionary. I would go for worldist to data, take the sentences as an input for a 2nd process documents, tokenize on words and calculate a cross distance on the result. There i would go for a high cosine similarity to define a "synomym". This dictionary can then be used to replace texts in the original document.

     

    ~Martin

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.