Text mining of mailing list traffic
Andrea_g
New Altair Community Member
I've just installed RapidMiner 5.2 and just noticed there is no importer for mailing box format. I'm interested in extracting mailing word frequencies.
Do you know any workflow or tutorial to perform this task with RapidMiner?
Right now I've managed to export the traffic in one big file in CSV format (from Thunderbird) but the RapidMiner CSV importer-parser gets very confused recognizing columns. Sample data can be found in the following list:
http://lists.gforge.inria.fr/pipermail/ecm-discuss/
Any help would be appreciated.
Do you know any workflow or tutorial to perform this task with RapidMiner?
Right now I've managed to export the traffic in one big file in CSV format (from Thunderbird) but the RapidMiner CSV importer-parser gets very confused recognizing columns. Sample data can be found in the following list:
http://lists.gforge.inria.fr/pipermail/ecm-discuss/
Any help would be appreciated.
Tagged:
0
Answers
-
Hi,
did you try the Read Documents (Mail) operator from the text mining extension?
Best,
Marius0 -
Yes but it seems to read from a mail store, not from a disk. I don't want to download my 4 GB of email and filter them again to start mining. May be there is a way to set up the connection to take mails from a file in local disk?Marius wrote:
Hi,
did you try the Read Documents (Mail) operator from the text mining extension?
Best,
Marius
Thanks
Andrea0 -
Ok, good point. Then let's return to the csv file exported from Thunderbird. I can't find a downloadable csv file at the link you provided, can you post some sample data? Where does Read CSV fail?
Best, Marius0 -
Hi Marius,
Just download and uncompress any file which is in in gzip format: http://lists.gforge.inria.fr/pipermail/ecm-discuss/2012-March.txt.gz
Import into a Firebird new folder.
Install this extension/add-on: ImportExportTools
Right click in the folder, Import/Export, Export all messages in the folder, Spreadsheet (CSV)
Let me know if you cannot reproduce the problem.
Cheers,
Andrea0 -
Hi,
the problem is that RapidMiner reads csv files line-wise. If a field contains linebreaks, they are ignored, even if the field is quoted. MS Excel seems to have the same problem. What I could do was:
1. Import the file with OpenOffice
2. Save it as MS Excel file
3. Import the xls file with RapidMiner
This worked for an exported folder of my own mailbox. I don't know however if that is scriptable for a huge number of files.
Happy Mining!
~Marius0