🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

preprocessing: remove email signature

User: "Joos"
New Altair Community Member
Updated by Jocelyn
Hi
I am trying to apply LDA to emails. I have the mails in an excel file. My model works, but I have to find a way to remove the email signature. Does anyone have experience?

Thanks

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "rfuentealba"
    New Altair Community Member
    Accepted Answer
    Hi @Joos,

    The problem is that messages have two parts: a header with a number of directions and a body containing text. You want to parse the "body".

    The first answer can be demonstrated with tje following two "e-mails":
    <b>From: Rodrigo <rodrigo@example.com>
    To: Joos <joos@example.org></b>
    
    Hello Joos,
    
    This is an example message.
    
    --
    Rodrigo Fuentealba
    Chile
    <b>From: Joos <joos@example.org>
    To: Rodrigo <rodrigo@example.com></b>
    
    Hi Rodrigo,
    
    I see. Even though the footers are different, there is something many users do, which is putting a -- before their signature. Not everyone follows this but a big part do.
    
    --
    Joos
    Netherlands
    Now, finding the last -- wouldn't work on all e-mails, because it is a convention but not a fact that people use -- on their e-mails to separate the signature from the rest. Let's think of another solution Let's say you have 1000 e-mails from me. If 300 of these e-mails end up with the classic "Sent from my iPhone" as the last line, you can identify that parameter and delete the e-mail. But what about all the e-mails that I sent with my own signature? You may be able to identify that 600 e-mails from rodrigo@example.com always have the "Rodrigo Fuentealba / Chile" signature, thus it can be removed.

    Answering your other questions:
    • Yes, you can use Python code inside RapidMiner with the Python Scripting extension. However, the mail parser extension probably won't help you, this is a natural language processing (or pattern recognition) issue.
    • Yes, you can do the pattern recognition in Dutch. I don't speak it, but have done similar stuff in German.
    All the best,

    Rodrigo.