🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

How to split text into paragraph from pdf document and extract their information?

User: "SteliosManolis1995"
New Altair Community Member
Updated by Jocelyn

I want to find a model of law splitting, which are pdf files. The law must split into Section, Part, Chapter, Article, Paragraph. It does not have to contain all of them. For example, one law may contain only Section and Part, while another may contain all of them. Also, after splitting, the information that the Section, Part, Chapter, Article and Paragraph may contain must be kept. All information should be displayed in separate columns in a table with as few errors as possible. The photo below shows all the possible ways in which a Greek law can be broken. Thanks in advance!


Find more posts tagged with

Sort by:
1 - 2 of 21
    User: "BalazsBaranyRM"
    New Altair Community Member
    Hi!

    Unfortunately, PDF files are highly unstructured. 

    Did you try to import example PDF files with RapidMiner? Is the text coming out correctly? Are there unique features in the document output that enable you to identify parts of the document as the correct part? If yes, try Extract Information or Generate Extract (depending on whether you want to work with documents or tables) with String Matching or Regular Expression or Region. 

    Are the different parts uniquely formatted in the PDF? (E. g. Paragraph is in italics, Section has Roman numbers, ...) If yes, you could try converting the PDFs with external tools into XHTML and process that with XML tools until you get a structure that you can read with Read XML. But it won't be easy.

    The best approach would be trying to get the originals in a structured form. 

    Regards,

    Balázs
    User: "SteliosManolis1995"
    New Altair Community Member
    OP
    Thanks for the answer!
    I’ll try your methods and I’ll be back soon with the results. 🙂

    Regards,

    Stelios