"[SOLVED]Classify PDF files using a set of wordlists"

sutlt
sutlt New Altair Community Member
edited November 5 in Community Q&A
Hi everyone,

My SPAD 7.4 went expired and it takes forever for my institute to negotiate a new license so I decided to move to RapidMiner.
I am wondering how to use RM to get to the following outcomes.

I have a large set (about 900) of equity reports in PDF format to analyze.  Each report ranges from 1 to 30 pages, but only the sentences with the word “quality” are relevant for my analysis. I have a list of negative words and a list of positive words that are used to describe “quality” constructed by someone else. What process in RM can be used to analyze the sentences with the word “quality” and then classify a PDF as (1) NEGATIVE VIEW if it describes the “quality” using any words from the negative word list, as (2) POSITIVE VIEW if it describes the “quality” using any words from the positive word list, and as (3) UNKNOWN if neither positive nor negative words are used.

Best,
Sutlt
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    in the post linked in my signature you will find a link to a tutorial site, which also covers text mining. You should have a look at those videos.
    Then, you can probably use Process Documents and inside a Tokenize operator which splits on sentence borders (!?.: etc). After that, use Filter Tokens to filter only relevant tokens which contain the word quality.
    The next steps depend a bit on how you define "describe quality". Please come back if you have any further questions, and describe a bit more detailed how the classification should work.

    Best,
    Marius
  • sutlt
    sutlt New Altair Community Member
    Marius wrote:

    Hi,

    in the post linked in my signature you will find a link to a tutorial site, which also covers text mining. You should have a look at those videos.
    Then, you can probably use Process Documents and inside a Tokenize operator which splits on sentence borders (!?.: etc). After that, use Filter Tokens to filter only relevant tokens which contain the word quality.
    The next steps depend a bit on how you define "describe quality". Please come back if you have any further questions, and describe a bit more detailed how the classification should work.

    Best,
    Marius
    Hi Marius,

    Many thanks for your reply.
    I watched those videos on Rapid-i.com and also those on vancouverdata.blogspot and I am now able to train RM, using a few pre-classified reports, to automatically classify those unclassified documents .
    Awesome!

    I would like to follow-up to see how to incorporate the word list.

    Specifically, instead of using pre-classified reports to train RM about what negative is, would it be possible to train RM with a list of negative words.
    I tried to simply replace the pre-classified reports with the list but the accruacy from the performance vector became 0.00%.

    The list is available here: http://www.nd.edu/~mcdonald/Data/Harvard%20IV_Negative%20Word%20List_Inf.txt, which consists of a few hundreds of words that are considerred negative in English.

    It looks like:

    ABSURD
    ABSURDITY
    ABUSE
    ABUSED
    ABUSER
    ABUSERS
    ABUSES
    ABUSING
    ABUSIVE
    ABUSIVELY
    ABUSIVENESS
    ...
    BAD
    BADLY
    ...
    Besides, While each word in the list is negative, a "not" can make things totally different.
    For instance, "quality is bad" and "quality is not bad" both have "bad" but the later one should not be classified as "negative" even though it has a negative word. Is there any operators in RM that can deal with this situation?

    All the best,
    Sutlt.

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    I don't think that with a simple word list you can achieve as good results as from training a model. But let's first deal with the technical issues:

    To overcome the negation problem, you can use the n-grams operator, which combines adjacent tokens to new tokens. I.e. from "not" and  "bad" it would create the new token "not_bad". Furthermore, your list contains a lot of variants of the same word. You can shorten the list by applying a Stemming operator on both the input data and the wordlist.

    Anything beyond this depends on you: how do you want to use the wordlist? Is a document negative, if it contains one element from the bad-list? Or 10 elements? Or more than 5% of its contents is found on the bad list? And without any positive examples, you won't be able to correctly classify documents which are positive, but nevertheless conain some words from the list.
  • sutlt
    sutlt New Altair Community Member
    Marius wrote:

    Hi,

    I don't think that with a simple word list you can achieve as good results as from training a model. But let's first deal with the technical issues:

    To overcome the negation problem, you can use the n-grams operator, which combines adjacent tokens to new tokens. I.e. from "not" and  "bad" it would create the new token "not_bad". Furthermore, your list contains a lot of variants of the same word. You can shorten the list by applying a Stemming operator on both the input data and the wordlist.

    Anything beyond this depends on you: how do you want to use the wordlist? Is a document negative, if it contains one element from the bad-list? Or 10 elements? Or more than 5% of its contents is found on the bad list? And without any positive examples, you won't be able to correctly classify documents which are positive, but nevertheless conain some words from the list.
    Hi Marius,

    Thanks again for your help.
    I think I now understand a little bit more.

    Best,
    Sutlt