Basic Text Mining From an Excel File

monamahfouz
monamahfouz New Altair Community Member
edited November 5 in Community Q&A
Hi everyone,

I would really appreciate some help / direction on how to tackle a basic text mining task. Basically, I have a spreadsheet that has one column that I am interested in, the column is titled: "Hashtags." I would like to count the occurrences of each unique hashtag, and output the number of occurrences of each, using RapidMiner.

A single row might have several hashtags in one cell, for example, row #1's value is: "12YearsASlave Oscars2014 AmericanHustle AcademyAwards2014" -- which means there are FOUR hashtags here and should each count towards the count of the four unique hashtags. Hence, I will need to tokenize every row's value.

If the tokenization is complex, I can ignore this bit and treat each row as one hashtag for now. My dataset is very large so I can ignore the rows that have multiple hashtags in one cell to get it to work.

I tried using SelectAttributes, Tokenize and DataToDocument but I am hitting a wall.

Any help / direction is appreciated, and hope this isn't too basic. Thanks for your help!
Mona
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi Mona,

    you don't need any Text Processing operators (in the RapidMiner sense) at all. First let's ignore the multi-tag rows:
    Load your data, and add a Filter Examples operator with the attribute_value_filter "Hashtag != .* .*" (without the quotes).
    Then add an Aggregate operator. Group by Hashtag and add the aggregation function count for Hashtag. That's it :)

    Best regards,
    Marius