Text Mining: analyse PDFs with a dictionary which has categories

nsmith
nsmith New Altair Community Member
edited November 5 in Community Q&A

Hello,

I want to analyse a number of PDFs (35) with kind of a dictionary. The output of the analysis should be an Excel File which shows how often every single word of the dictionary appears in the PDFs. Maybe it's important to know that the dictionary is not only a list of words. Instead the words are classified into five categories. Thus the analysis should give me information about how much is reported on the words of the dictionary and about which category is reported the most.

I already read lots of questions here and also watched tutorials, but I could not find exactly what I need. Trial and error didn't work as well up to now. Hope someone can help me.

Many thanks in advance,

Nina

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,
    this really depends on the format of your PDF. Did you try to just read one of them using the Read Document operator?

    Best,
    Martin
  • nsmith
    nsmith New Altair Community Member
    edited October 2020
    Yes, I read the PDFs with the Read Document Operator - that works. The problem is the dictionary. I'm not able to filter the PDFs with my dictionary (which consists of words in a excel file), so that I can see how often each word appears in the PDF. Furthermore I don't know how I can take account of the categories in my dictionary. Wheter there is a possibility that RapidMiner can recognize categories in a dictionary (maybe if for example each category is written in a new tab of my excel file)  or if I need some additional operator for that.

    Thanks for your help,
    Nina

  • Telcontar120
    Telcontar120 New Altair Community Member
    It sounds like you want to use a specific wordlist and then count the words based on that wordlist (which are further grouped into 5 categories).  You should be able to input your desired wordlist into the input port of the Process Documents operator.  You can then use the Wordlist to Data operator on the resulting wordlist to turn it into a normal dataset that you can then summarize or use your grouping to do the category analysis.

  • nsmith
    nsmith New Altair Community Member
    Thanks for your answer @Telcontar120 !
    Yes, you're right. I have a word list with key words (which are categorized) and want to scan all my PDFs for these words. Thus I only want to see this words and their occurence in the result view. 
    I tried your proposal, but I couldn't put the Wordlist into the input port and then connect with the process documents operator as an error occured. Furthermore I'm not sure where to add all my PDFs that should be analysed. Are both, the wordlist and the PDFs, set as an input for the process documents operator? 

    I hope my problem is not too confusing. Maybe it helps to have a look at the XML I posted before. 

  • Telcontar120
    Telcontar120 New Altair Community Member
    @mschmitz is there a way to import a wordlist from an external file to be used as input for Process Documents? Or a relevant converter that can be used? Upon looking at the operator more closely, it seems like it is requiring a wordlist already in RapidMiner format, which normally can be generated only from another Process Documents operator.  Of course it would be possible to work around this by putting the desired wordlist as text into one Process Documents operator merely to generate the wordlist to feed another Process Documents operator, but this seems somewhat inefficient and I am wondering if there is a more direct path.
    @nsmith see my comments above regarding the wordlist input.  It may be that you need to generate your wordlist first.  Regarding the pdfs, you can use Process Documents from Files and then set your parameters to read your pdf files from your hard drive.

  • MartinLiebig
    MartinLiebig
    Altair Employee
    I think there is no way to generate a word list. Keep in mind that the wordlist contains also normalization factors for TF-IDF etc.But I think we can just do the full Occurrence matrix here and filter the attributes later for the ones we are interested? Alternatively you can just use Filter Token by Example Set in Process Documents.

    Best,
    Martin
  • Telcontar120
    Telcontar120 New Altair Community Member
    @mschmitz thanks, yes Filter Tokens by Exampleset should have the equivalent effect.

  • nsmith
    nsmith New Altair Community Member

    @mschmitz @Telcontar120 thank you very much for your answers, it's nearly working now! :)

    Unfortunately there is still one problem with the "Filter Tokens Using ExampleSet" operator. I want to filter with my word list, which has two kinds of words.

    1. Single words (like "digital")
    2. Terms with two or more words (like "digital products")

    In general it's working as I used the "Generate n-gramms" operator before. Thus all stand-alone words and terms I specified are in the result list. The problem is that the operator generates also terms, which I did not exactly mention in  the word list. An example is "accelerating_digital". Even though I did not have this term in my word list, I want to have it in my result list as it contains the word "digital" (which is in my word list). 

    Is there a way to solve this problem?

     


  • Telcontar120
    Telcontar120 New Altair Community Member
    If you change the order of your operators you should be able to resolve this. You may need to redo some work in that you would filter the text using your word list first, then generate the resulting word vector, then use the Generate n-grams operator to build the combinations after that.

  • nsmith
    nsmith New Altair Community Member
    edited October 2020
    Thank you so much for your fast answer @Telcontar120 ! I tried a few possibilites and changed the operators, but it doesn't really work. I'm rather getting no result in the result list or I'm getting results but by proving them I realize that not every word which is in the word list and the text is shown in the result list.
    I also tried to place the "generate n-gramms" operator at the end of the same "process documents" operator as the "filter tokens" operator is. Nothing really worked so far.