"Content analysis of annual corporate reports with text processing"

Nadine_12
Nadine_12 New Altair Community Member
edited November 5 in Community Q&A

Hi RM-Crew,

 

I have a question regarding a text mining project I want to do for my master thesis.

I want to do a content analysis of corporate disclosure. So I want to train a model with an example set (excel list with representative sentences classified in one of 6 topic categories). After that I want to apply that model to several unknown annual reports (pdf format) of companies to measure how much they are disclosing regarding that 6 categories.

Now I am a little bit lost with choosing the right transforming processes for the annual report. I could tokenize the documents so I get a full list of sentences. But actually I don´t want every sentence to be categorized. I only want the model to measure how much of the content of each annual report refers to one of the 6 topics..

Do you have an idea or did somebody have a similar project?

 

Thanks and best regards,

Nadine


Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    I think you are on the right track, but what you are describing is actually a somewhat complex text analysis project. It sounds like you don't want to structure the data quite the way you have it. 
    What you are describing is very similar to LDA, which is a topic modeling approach for text data.  Check out the operator and the tutorial sample included in RapidMiner (you'll need the Toolbox extension, which is free).  However, this doesn't allow you to "train" the classifier with particular examples; instead, it looks for patterns in the data and comes up with its own topic groupings.  To train it you will feed it the entire document and then tokenize typically at the word rather than the sentence level, because it is much more granular and accurate that way.
    If you really have to train the data based on your predefined categories, and the categories are not mutually exclusive, then you are likely to have to build 6 separate predictive models, one for each topic.  And then run every document through those models and get a confidence score for each topic.  In that case you probably want to tokenize every document at the word level again, so the model has the raw material in the most flexible form, to determine the classification labels (that you will provide for an initial sample). 
  • Nadine_12
    Nadine_12 New Altair Community Member

    @Telcontar120,


    thank you very much for your detailed answer.

    Yes, unfortunately I have to train the model to recognize and measure pre-defined categories, so the LDA will apparently not work out for me..

    Do you know which predictive models are usually used for such text analysis projects (unfortunately the website https://mod.rapidminer.com/ for finding the right model doesn´t work as well as the link Ingo posted in his article https://rapidminer.com/blog/doc-ingo-what-model-should-i-use/ )?

     

    I assume the confidence score as the result of each topic model could be interpreted as the “amount of information” the unknown text contains about the specific topic, right?

     

    Thank you very, very much. You are really helping me a lot!

     

    Best regards,

    Nadine


  • Telcontar120
    Telcontar120 New Altair Community Member
    Unfortunately there isn't just one "killer algorithm" when it comes to text classification, at least not in my experience.  But there are a few that I would definitely try, including: k-nn, Naive Bayes, SVM, and neural net/deep learning.  So you can build a topic model with each of the above algorithms and compare them, and also consider using them in an ensemble solution as well.
    The resulting score is more accurately interpreted as the confidence that the algorithm has that a specific text relates to a specific topic.  So you would likely want to rank them and establish some threshhold cutoffs to say which ones were related to or "about" each topic.  To say whether a document is really "about" a given topic is a somewhat individualized judgment, I think.
    By the way, if you are strictly looking to see whether a given document mentions a given set of words or phrases, you can build a rule-based model pretty easily using the binary occurrences word vector and a given wordlist.  That will simply let you know which documents contain those words or phrases.  Sometimes that type of "score" is used in an audit context (e.g., it will return positive for every document that contains words of interest) but it doesn't necessarily tell whether the document is "about" that topic.