Basic Text Mining From an Excel File
monamahfouz
New Altair Community Member
Hi everyone,
I would really appreciate some help / direction on how to tackle a basic text mining task. Basically, I have a spreadsheet that has one column that I am interested in, the column is titled: "Hashtags." I would like to count the occurrences of each unique hashtag, and output the number of occurrences of each, using RapidMiner.
A single row might have several hashtags in one cell, for example, row #1's value is: "12YearsASlave Oscars2014 AmericanHustle AcademyAwards2014" -- which means there are FOUR hashtags here and should each count towards the count of the four unique hashtags. Hence, I will need to tokenize every row's value.
If the tokenization is complex, I can ignore this bit and treat each row as one hashtag for now. My dataset is very large so I can ignore the rows that have multiple hashtags in one cell to get it to work.
I tried using SelectAttributes, Tokenize and DataToDocument but I am hitting a wall.
Any help / direction is appreciated, and hope this isn't too basic. Thanks for your help!
Mona
I would really appreciate some help / direction on how to tackle a basic text mining task. Basically, I have a spreadsheet that has one column that I am interested in, the column is titled: "Hashtags." I would like to count the occurrences of each unique hashtag, and output the number of occurrences of each, using RapidMiner.
A single row might have several hashtags in one cell, for example, row #1's value is: "12YearsASlave Oscars2014 AmericanHustle AcademyAwards2014" -- which means there are FOUR hashtags here and should each count towards the count of the four unique hashtags. Hence, I will need to tokenize every row's value.
If the tokenization is complex, I can ignore this bit and treat each row as one hashtag for now. My dataset is very large so I can ignore the rows that have multiple hashtags in one cell to get it to work.
I tried using SelectAttributes, Tokenize and DataToDocument but I am hitting a wall.
Any help / direction is appreciated, and hope this isn't too basic. Thanks for your help!
Mona
Tagged:
0
Answers
-
Hi Mona,
you don't need any Text Processing operators (in the RapidMiner sense) at all. First let's ignore the multi-tag rows:
Load your data, and add a Filter Examples operator with the attribute_value_filter "Hashtag != .* .*" (without the quotes).
Then add an Aggregate operator. Group by Hashtag and add the aggregation function count for Hashtag. That's it
Best regards,
Marius0