preprocessing: remove email signature

Joos
Joos New Altair Community Member
edited November 5 in Community Q&A
Hi
I am trying to apply LDA to emails. I have the mails in an excel file. My model works, but I have to find a way to remove the email signature. Does anyone have experience?

Thanks
Tagged:

Best Answer

  • rfuentealba
    rfuentealba New Altair Community Member
    Answer ✓
    Hi @Joos,

    The problem is that messages have two parts: a header with a number of directions and a body containing text. You want to parse the "body".

    The first answer can be demonstrated with tje following two "e-mails":
    <b>From: Rodrigo <rodrigo@example.com>
    To: Joos <joos@example.org></b>
    
    Hello Joos,
    
    This is an example message.
    
    --
    Rodrigo Fuentealba
    Chile
    <b>From: Joos <joos@example.org>
    To: Rodrigo <rodrigo@example.com></b>
    
    Hi Rodrigo,
    
    I see. Even though the footers are different, there is something many users do, which is putting a -- before their signature. Not everyone follows this but a big part do.
    
    --
    Joos
    Netherlands
    Now, finding the last -- wouldn't work on all e-mails, because it is a convention but not a fact that people use -- on their e-mails to separate the signature from the rest. Let's think of another solution Let's say you have 1000 e-mails from me. If 300 of these e-mails end up with the classic "Sent from my iPhone" as the last line, you can identify that parameter and delete the e-mail. But what about all the e-mails that I sent with my own signature? You may be able to identify that 600 e-mails from rodrigo@example.com always have the "Rodrigo Fuentealba / Chile" signature, thus it can be removed.

    Answering your other questions:
    • Yes, you can use Python code inside RapidMiner with the Python Scripting extension. However, the mail parser extension probably won't help you, this is a natural language processing (or pattern recognition) issue.
    • Yes, you can do the pattern recognition in Dutch. I don't speak it, but have done similar stuff in German.
    All the best,

    Rodrigo.

Answers

  • rfuentealba
    rfuentealba New Altair Community Member
    Hello @Joos,

    I can only recommend two ways. The first one is to remove everything from the last -- signs together to the end. Or, if you have the recipient of the e-mail, trim the message and check the last line on each e-mail until no last lines are the same.

    Both aren't battle tested, and involve some processing that I wouldn't have done with RapidMiner but much earlier, while retrieving the e-mails, so you are better of trying your luck with loading your data with Python to remove the e-mail signatures, I'm afraid.

    All the best,

    Rodrigo.
  • Joos
    Joos New Altair Community Member
    Thank you for your answer. Not sure I understand your first option, because all the footers are different, so I do not know how to recognize them. I did find python code on github (mailparser). Is it possible to include this as python script in the code? I can include it in my loop going over the different mails and pass it on the python parser as a document? Probably the python code would need adjustment to get this working? Moreover, it would have to do the parsing in Dutch? Do you have experience in this? Your 
  • rfuentealba
    rfuentealba New Altair Community Member
    Answer ✓
    Hi @Joos,

    The problem is that messages have two parts: a header with a number of directions and a body containing text. You want to parse the "body".

    The first answer can be demonstrated with tje following two "e-mails":
    <b>From: Rodrigo <rodrigo@example.com>
    To: Joos <joos@example.org></b>
    
    Hello Joos,
    
    This is an example message.
    
    --
    Rodrigo Fuentealba
    Chile
    <b>From: Joos <joos@example.org>
    To: Rodrigo <rodrigo@example.com></b>
    
    Hi Rodrigo,
    
    I see. Even though the footers are different, there is something many users do, which is putting a -- before their signature. Not everyone follows this but a big part do.
    
    --
    Joos
    Netherlands
    Now, finding the last -- wouldn't work on all e-mails, because it is a convention but not a fact that people use -- on their e-mails to separate the signature from the rest. Let's think of another solution Let's say you have 1000 e-mails from me. If 300 of these e-mails end up with the classic "Sent from my iPhone" as the last line, you can identify that parameter and delete the e-mail. But what about all the e-mails that I sent with my own signature? You may be able to identify that 600 e-mails from rodrigo@example.com always have the "Rodrigo Fuentealba / Chile" signature, thus it can be removed.

    Answering your other questions:
    • Yes, you can use Python code inside RapidMiner with the Python Scripting extension. However, the mail parser extension probably won't help you, this is a natural language processing (or pattern recognition) issue.
    • Yes, you can do the pattern recognition in Dutch. I don't speak it, but have done similar stuff in German.
    All the best,

    Rodrigo.
  • Joos
    Joos New Altair Community Member
    Thanks Rodrigo...I kind of fixed the issue in excel with formulas