🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Calculate number of unique words in text and number of repeating paragraphs

User: "In777"
New Altair Community Member
Updated by Jocelyn

How can I calculate the number of unique mentions of each words (tokens without stopwords) in each text document? Besides how can I find the number of repeating sentences or paragraphs? Is there any operators in text mining extension?

Find more posts tagged with

Sort by:
1 - 5 of 51
    User: "MartinLiebig"
    Altair Employee
    Accepted Answer

    Hi,

     

    you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.

     

    ~Martin

    User: "In777"
    New Altair Community Member
    OP

    Thank you I think that will work. And what about repeating sentences? I tried the similarity measure first, but I have too long documents, so it will not work.

    Hi,

     

    Simply tokenize on linguistic sentences and do the same trick as for words.

     

    ~Martin

    User: "In777"
    New Altair Community Member
    OP

    Hi Martin,

     

    Thank you for the answer. I have a follow-up question: if the sentences are not complete the same, but very similar (e.g. 2-3 words are changed), how could I find the repeating text parts then?

    Hi ln777,

    you are always allowed to ask questions - that's what we are here for :). The only question is if we can answer them.

     

    i would create a similarity/synonym dictionary. I would go for worldist to data, take the sentences as an input for a 2nd process documents, tokenize on words and calculate a cross distance on the result. There i would go for a high cosine similarity to define a "synomym". This dictionary can then be used to replace texts in the original document.

     

    ~Martin