[SOLVED] Deleting text noise from large corpus

New Altair Community Member

Aug 28, 2012

Updated Nov 5, 2024 by Jocelyn

Hi

I have a pdf file which contains several thousand pages of emails. The problem is that each email contains a unique set of noise (unique because it does not repeat). For example:

x-Mail: hbcFNvIWLDtFlpP.yxyP9bkreUY5ZzdUGPpkOhYIoR

This noise sometimes fills entire pages.

Can anyone point me in the right direction on how to minimize this noise, or somehow go around it?

Thanks.

Find more posts tagged with

AI Studio

Sort by:

1 - 1 of 11

MariusHelf

New Altair Community Member

Aug 29, 2012

Hi Sal,

if you use the TF-IDF measure, the noise will be ignored (gets value 0), because it appears in only one document and thus does not bring in any advantage for text classification.
Furthermore, the Process Documents operator has parameters to filter out words that appear too seldom (or too often).

Best,
Marius

[SOLVED] Deleting text noise from large corpus

Find more posts tagged with

Quick Links