preprocessing: remove email signature

Joos
Joos New Altair Community Member
edited November 2024 in Community Q&A
Hi
I am trying to apply LDA to emails. I have the mails in an excel file. My model works, but I have to find a way to remove the email signature. Does anyone have experience?

Thanks
Tagged:

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answer

  • rfuentealba
    rfuentealba New Altair Community Member
    Answer ✓
    Hi @Joos,

    The problem is that messages have two parts: a header with a number of directions and a body containing text. You want to parse the "body".

    The first answer can be demonstrated with tje following two "e-mails":
    <b>From: Rodrigo <rodrigo@example.com>
    To: Joos <joos@example.org></b>
    
    Hello Joos,
    
    This is an example message.
    
    --
    Rodrigo Fuentealba
    Chile
    <b>From: Joos <joos@example.org>
    To: Rodrigo <rodrigo@example.com></b>
    
    Hi Rodrigo,
    
    I see. Even though the footers are different, there is something many users do, which is putting a -- before their signature. Not everyone follows this but a big part do.
    
    --
    Joos
    Netherlands
    Now, finding the last -- wouldn't work on all e-mails, because it is a convention but not a fact that people use -- on their e-mails to separate the signature from the rest. Let's think of another solution Let's say you have 1000 e-mails from me. If 300 of these e-mails end up with the classic "Sent from my iPhone" as the last line, you can identify that parameter and delete the e-mail. But what about all the e-mails that I sent with my own signature? You may be able to identify that 600 e-mails from rodrigo@example.com always have the "Rodrigo Fuentealba / Chile" signature, thus it can be removed.

    Answering your other questions:
    • Yes, you can use Python code inside RapidMiner with the Python Scripting extension. However, the mail parser extension probably won't help you, this is a natural language processing (or pattern recognition) issue.
    • Yes, you can do the pattern recognition in Dutch. I don't speak it, but have done similar stuff in German.
    All the best,

    Rodrigo.

Answers

  • rfuentealba
    rfuentealba New Altair Community Member
    Hello @Joos,

    I can only recommend two ways. The first one is to remove everything from the last -- signs together to the end. Or, if you have the recipient of the e-mail, trim the message and check the last line on each e-mail until no last lines are the same.

    Both aren't battle tested, and involve some processing that I wouldn't have done with RapidMiner but much earlier, while retrieving the e-mails, so you are better of trying your luck with loading your data with Python to remove the e-mail signatures, I'm afraid.

    All the best,

    Rodrigo.
  • Joos
    Joos New Altair Community Member
    Thank you for your answer. Not sure I understand your first option, because all the footers are different, so I do not know how to recognize them. I did find python code on github (mailparser). Is it possible to include this as python script in the code? I can include it in my loop going over the different mails and pass it on the python parser as a document? Probably the python code would need adjustment to get this working? Moreover, it would have to do the parsing in Dutch? Do you have experience in this? Your 
  • rfuentealba
    rfuentealba New Altair Community Member
    Answer ✓
    Hi @Joos,

    The problem is that messages have two parts: a header with a number of directions and a body containing text. You want to parse the "body".

    The first answer can be demonstrated with tje following two "e-mails":
    <b>From: Rodrigo <rodrigo@example.com>
    To: Joos <joos@example.org></b>
    
    Hello Joos,
    
    This is an example message.
    
    --
    Rodrigo Fuentealba
    Chile
    <b>From: Joos <joos@example.org>
    To: Rodrigo <rodrigo@example.com></b>
    
    Hi Rodrigo,
    
    I see. Even though the footers are different, there is something many users do, which is putting a -- before their signature. Not everyone follows this but a big part do.
    
    --
    Joos
    Netherlands
    Now, finding the last -- wouldn't work on all e-mails, because it is a convention but not a fact that people use -- on their e-mails to separate the signature from the rest. Let's think of another solution Let's say you have 1000 e-mails from me. If 300 of these e-mails end up with the classic "Sent from my iPhone" as the last line, you can identify that parameter and delete the e-mail. But what about all the e-mails that I sent with my own signature? You may be able to identify that 600 e-mails from rodrigo@example.com always have the "Rodrigo Fuentealba / Chile" signature, thus it can be removed.

    Answering your other questions:
    • Yes, you can use Python code inside RapidMiner with the Python Scripting extension. However, the mail parser extension probably won't help you, this is a natural language processing (or pattern recognition) issue.
    • Yes, you can do the pattern recognition in Dutch. I don't speak it, but have done similar stuff in German.
    All the best,

    Rodrigo.
  • Joos
    Joos New Altair Community Member
    Thanks Rodrigo...I kind of fixed the issue in excel with formulas

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.