Hi RM-Crew,
I have a question
regarding a text mining project I want to do for my master thesis.
I want to
do a content analysis of corporate disclosure. So I want to train a model with
an example set (excel list with representative sentences classified in one of 6
topic categories). After that I want to apply that model to several unknown
annual reports (pdf format) of companies to measure how much they are disclosing
regarding that 6 categories.
Now I am a
little bit lost with choosing the right transforming processes for the annual
report. I could tokenize the documents so I get a full list of sentences. But
actually I don´t want every sentence to be categorized. I only want the model
to measure how much of the content of each annual report refers to one of the 6
topics..
Do you have
an idea or did somebody have a similar project?
Thanks and
best regards,
Nadine