Create Document From Specific PDF Sections

Hi Everyone,

I am trying to create a single document from a pdf file that has different rows for each of the repetead sections in the single pdf file.

Example: I have a pdf document that has a lot of text however in the document there is one repeated section of comments that I would like to collect for each ID associated with the comments. I have used the process documents from files and used extract information operator with a start and end expression to capture the comments in between. it works for the first section that the start and finish expressions are found but doesnt captures the rest of the sections.

Please let me know if I need to explain this any further.

Thank you

Blah Blah Blah

ROWID

Start Section

Comments i need

End Section

Blah Blah Blah

ROWID

Start section

Comments i need

End Section

Final Example Set wotuld be in this form

ROWID - Comments

ROWID - Comment

Find more posts tagged with

AI Studio

PDFs

Text Mining + NLP

Accepted answers

bhupendra_patil

Not specifically your use case, but this knowledgebase article does soemthign similar where we are cutting document based on fullstop/and/but

http://community.rapidminer.com/t5/Text-Analytics/Splitting-text-into-sentences/ta-p/31845

You can potentially apply the exact same process, but with different limiters based on your criteria.

let us know if this helps

All comments

bhupendra_patil

Not specifically your use case, but this knowledgebase article does soemthign similar where we are cutting document based on fullstop/and/but

http://community.rapidminer.com/t5/Text-Analytics/Splitting-text-into-sentences/ta-p/31845

You can potentially apply the exact same process, but with different limiters based on your criteria.

let us know if this helps

nickshel81

Thanks for the reply. I actually figured out how to do it using the cut documents and extract information operators and it worked great!

On a different subject I was wondering if you can help me or point me tosomeone who can help me with the post I have submitted a while back and never got any reply from anyone else. http://community.rapidminer.com/t5/RapidMiner-Studio/Emphasize-certain-tokens-for-classification/m-p/31650

Thanks again for your help

CraigBostonUSA

You can read PDF files and turn them into readable text with the extension, PDF Table Extraction

https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_pdf_table_extraction