🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Topic Modeling for PDF files

User: "Karissa"
New Altair Community Member
Updated by Jocelyn
Hello everyone,

I want to read several PDF files (business reports) and analyze them. Until now I use the operator Read Douments, because I haven't found a better operator yet.
I want to do a topic modeling on the files to find out relevant topics. A pre-processing is done by the operators Tokenize, Transform Cases, Filter Stopwords, Filter Tokens by Length and Stem. For this I have found the two operators: Extract Topics from Documents (LDA) and Extract Topics from Data (LDA). Unfortunately both do not work properly.
Extract Topics from Documents( LDA) needs a collection as input and I don't know how to get it.
And Extract Topics from Data (LDA) needs a text attribute and again I don't know how to get it.

Accordingly, I have these two questions:
1) Is there an operator I can use to read in multiple PDF files?
2) What is the best operator for Topic Modeling and how do I implement it?

I have created the process below, it runs, but I only get null values as results. Does anyone have a tip for me?

Many thanks for the help

Find more posts tagged with

Sort by:
1 - 4 of 41
    Hey,
    I think what you want to do is use Loop Files, to loop over your files and then use Read document inside. What you will receive is a collection of documents, which you process as needed.

    Cheers,
    Martin
    User: "Karissa"
    New Altair Community Member
    OP
    Thank you @MartinLiebig . The Loop Files Operator worked.
    The process runs through, but all results are zero/null. What could be the reason for this?


    Many thanks

    User: "MartinLiebig"
    Altair Employee
    Accepted Answer
    Hi,
    likely the texts are for some reasons empty?

    BR,
    Martin
    User: "Karissa"
    New Altair Community Member
    OP
    I have changed the process and now I get a result. Many thanks.