Extract e-mail adresses out of a pdf
marcel_hanselma
New Altair Community Member
Hello dear Rapidminer community,
I have a pdf full of adresses (name, street, phonenumber, email). What I want is to extract only all the e-mail adresses and store them line per line in an excel or csv. How is the approach to this? (I am really a Rapidminer newbie)
Greetings, Marcel
I have a pdf full of adresses (name, street, phonenumber, email). What I want is to extract only all the e-mail adresses and store them line per line in an excel or csv. How is the approach to this? (I am really a Rapidminer newbie)
Greetings, Marcel
0
Best Answer
-
Hi @marcel_hanselma,
Although your PDF is a scan and is not nicely formatted, it is workable : We can extract the email addresses. I used "Read Document" operator as mentioned by Jacob. Here the result :
I used a Python script to search, extract and display the e-mail addresses because it is very easy with this language.
(With RapidMiner native operator(s), I was unable to extract ALL the occurrences : I'm just able to find and extract the first occurrence.)
Thus to run the process in attached file, you will need :
- to install Python in your machine (you can install it via Anaconda)
- to install the Python scripting extension from the marketplace. Don't forget to set in the Rapidminer settings, the path where your Python.exe file is installed.
Hope this helps,
Regards,
Lionel
PS : Given that there are more than 1700 e-mails addresses in your document, the process computation is not instantaneous : You have to wait around 2 minutes...
2
Answers
-
When your PDF has a nicely formatted table, the PDF Table Extraction extension will do this in no time or effort. Otherwise you can use "Read Document" from Text Processing extension and do a bit of gymnastics parsing the text.Jacob
1 -
That do a bit of gymnastics is what i am missing. I can read the document, but then i fail to extract all the e-mail addresses. The PDF is not nicely formatted.0
-
Hi @marcel_hanselma,
Can you provide your .pdf file in order we can see how to extract the e-mail adresses ?
You can send it via private message if it is not confidential...
Regards,
Lionel0 -
Hi @marcel_hanselma,
Although your PDF is a scan and is not nicely formatted, it is workable : We can extract the email addresses. I used "Read Document" operator as mentioned by Jacob. Here the result :
I used a Python script to search, extract and display the e-mail addresses because it is very easy with this language.
(With RapidMiner native operator(s), I was unable to extract ALL the occurrences : I'm just able to find and extract the first occurrence.)
Thus to run the process in attached file, you will need :
- to install Python in your machine (you can install it via Anaconda)
- to install the Python scripting extension from the marketplace. Don't forget to set in the Rapidminer settings, the path where your Python.exe file is installed.
Hope this helps,
Regards,
Lionel
PS : Given that there are more than 1700 e-mails addresses in your document, the process computation is not instantaneous : You have to wait around 2 minutes...
2 -
Wow, thank you Lionel.
It worked flawless. :-)2