preprocessing: remove email signature
Joos
New Altair Community Member
Best Answer
-
Hi @Joos,
The problem is that messages have two parts: a header with a number of directions and a body containing text. You want to parse the "body".
The first answer can be demonstrated with tje following two "e-mails":<b>From: Rodrigo <rodrigo@example.com> To: Joos <joos@example.org></b> Hello Joos, This is an example message. -- Rodrigo Fuentealba Chile
<b>From: Joos <joos@example.org> To: Rodrigo <rodrigo@example.com></b> Hi Rodrigo, I see. Even though the footers are different, there is something many users do, which is putting a -- before their signature. Not everyone follows this but a big part do. -- Joos Netherlands
Now, finding the last -- wouldn't work on all e-mails, because it is a convention but not a fact that people use -- on their e-mails to separate the signature from the rest. Let's think of another solution Let's say you have 1000 e-mails from me. If 300 of these e-mails end up with the classic "Sent from my iPhone" as the last line, you can identify that parameter and delete the e-mail. But what about all the e-mails that I sent with my own signature? You may be able to identify that 600 e-mails from rodrigo@example.com always have the "Rodrigo Fuentealba / Chile" signature, thus it can be removed.
Answering your other questions:- Yes, you can use Python code inside RapidMiner with the Python Scripting extension. However, the mail parser extension probably won't help you, this is a natural language processing (or pattern recognition) issue.
- Yes, you can do the pattern recognition in Dutch. I don't speak it, but have done similar stuff in German.
Rodrigo.1
Answers
-
Hello @Joos,
I can only recommend two ways. The first one is to remove everything from the last -- signs together to the end. Or, if you have the recipient of the e-mail, trim the message and check the last line on each e-mail until no last lines are the same.
Both aren't battle tested, and involve some processing that I wouldn't have done with RapidMiner but much earlier, while retrieving the e-mails, so you are better of trying your luck with loading your data with Python to remove the e-mail signatures, I'm afraid.
All the best,
Rodrigo.0 -
Thank you for your answer. Not sure I understand your first option, because all the footers are different, so I do not know how to recognize them. I did find python code on github (mailparser). Is it possible to include this as python script in the code? I can include it in my loop going over the different mails and pass it on the python parser as a document? Probably the python code would need adjustment to get this working? Moreover, it would have to do the parsing in Dutch? Do you have experience in this? Your0
-
Hi @Joos,
The problem is that messages have two parts: a header with a number of directions and a body containing text. You want to parse the "body".
The first answer can be demonstrated with tje following two "e-mails":<b>From: Rodrigo <rodrigo@example.com> To: Joos <joos@example.org></b> Hello Joos, This is an example message. -- Rodrigo Fuentealba Chile
<b>From: Joos <joos@example.org> To: Rodrigo <rodrigo@example.com></b> Hi Rodrigo, I see. Even though the footers are different, there is something many users do, which is putting a -- before their signature. Not everyone follows this but a big part do. -- Joos Netherlands
Now, finding the last -- wouldn't work on all e-mails, because it is a convention but not a fact that people use -- on their e-mails to separate the signature from the rest. Let's think of another solution Let's say you have 1000 e-mails from me. If 300 of these e-mails end up with the classic "Sent from my iPhone" as the last line, you can identify that parameter and delete the e-mail. But what about all the e-mails that I sent with my own signature? You may be able to identify that 600 e-mails from rodrigo@example.com always have the "Rodrigo Fuentealba / Chile" signature, thus it can be removed.
Answering your other questions:- Yes, you can use Python code inside RapidMiner with the Python Scripting extension. However, the mail parser extension probably won't help you, this is a natural language processing (or pattern recognition) issue.
- Yes, you can do the pattern recognition in Dutch. I don't speak it, but have done similar stuff in German.
Rodrigo.1 -
Thanks Rodrigo...I kind of fixed the issue in excel with formulas0