Compare 2 pdf texts

c_sabine
c_sabine New Altair Community Member
edited November 5 in Community Q&A

Hello, 

I'm trying to create a process which consist on comparing 2 pdf that are subtly different.

I process my documents (tokenize, filter stopwords, generate n grams...) from two differents files and merge it into one common example set with the operator "Append" and use the operator "Remove duplicates" to see differences in the pdf. Please find attached my process, I have 2 questions :

1) Is it possible to convert my example set result into a wordlist to have a table by row rather than column ?

2) It seems that something went wrong because there are words which are in the 2 files which appears in the output, while it should show words that are in a specific document and whiich is absent in the other one, and so on

 

Thanks !

 

Sabine

 

 

 

Tagged:

Answers

  • c_sabine
    c_sabine New Altair Community Member

    Please find attached a screen of my process, the second pictures describe what is contained inside the two operators "Process document from files".

  • Telcontar120
    Telcontar120 New Altair Community Member

    When you generate the original wordlist from each pdf, you can use "Wordlist to Data" operator to create examplesets of the words and their counts. You could then add a source field (with Generate Attributes or via a macro) for each pdf, and then merge/join those two datasets.  That should enable you to see easily which words are common to both files and which ones are unique to one or the other.