Extract e-mail adresses out of a pdf

marcel_hanselma
marcel_hanselma New Altair Community Member
edited November 5 in Community Q&A
Hello dear Rapidminer community,
I have a pdf full of adresses (name, street, phonenumber, email). What I want is to extract only all the e-mail adresses and store them line per line in an excel or csv. How is the approach to this? (I am really a Rapidminer newbie) 
Greetings, Marcel
Tagged:

Best Answer

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi @marcel_hanselma,

    Although your PDF is a scan and is not nicely formatted, it is workable : We can extract the email addresses. I used "Read Document" operator as mentioned by Jacob. Here the result : 



    I used a Python script to search, extract and display the e-mail addresses because it is very easy with this language.
    (With RapidMiner native operator(s), I was unable to extract ALL the occurrences : I'm just able to find and extract the first occurrence.)
    Thus to run the process in attached file, you will need : 
     - to install Python in your machine (you can install it via Anaconda)
     - to install the Python scripting extension from the marketplace. Don't forget to set in the Rapidminer settings, the path where your Python.exe file is installed.

    Hope this helps,

    Regards,

    Lionel

    PS : Given that there are more than 1700 e-mails addresses in your document, the process computation is not instantaneous : You have to wait around 2 minutes...


Answers

  • jacobcybulski
    jacobcybulski New Altair Community Member
    When your PDF has a nicely formatted table, the PDF Table Extraction extension will do this in no time or effort. Otherwise you can use "Read Document" from Text Processing extension and do a bit of gymnastics parsing the text.
    Jacob

  • marcel_hanselma
    marcel_hanselma New Altair Community Member
    That do a bit of gymnastics is what i am missing. I can read the document, but then i fail to extract all the e-mail addresses. The PDF is not nicely formatted.
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @marcel_hanselma,

    Can you provide your .pdf file in order we can see how to extract the e-mail adresses ?

    You can send it via private message if it is not confidential...

    Regards,

    Lionel
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi @marcel_hanselma,

    Although your PDF is a scan and is not nicely formatted, it is workable : We can extract the email addresses. I used "Read Document" operator as mentioned by Jacob. Here the result : 



    I used a Python script to search, extract and display the e-mail addresses because it is very easy with this language.
    (With RapidMiner native operator(s), I was unable to extract ALL the occurrences : I'm just able to find and extract the first occurrence.)
    Thus to run the process in attached file, you will need : 
     - to install Python in your machine (you can install it via Anaconda)
     - to install the Python scripting extension from the marketplace. Don't forget to set in the Rapidminer settings, the path where your Python.exe file is installed.

    Hope this helps,

    Regards,

    Lionel

    PS : Given that there are more than 1700 e-mails addresses in your document, the process computation is not instantaneous : You have to wait around 2 minutes...


  • marcel_hanselma
    marcel_hanselma New Altair Community Member
    Wow, thank you Lionel.
    It worked flawless. :-)