"Brand new user - text mining basics"

GizaJack
GizaJack New Altair Community Member
edited November 5 in Community Q&A
Hello. If anyone is willing to help point a newb in the right direction, I'd appreciate it very much. I am working on a personal project to get a feel for the software and concepts, which will lead into a school project. I have an Excel file with the lyrics and some other basic information of several hundred songs. I wanted to look for interesting relationships with word usage perhaps in songs by artists of a certain gender, the year the song was written, and/or hit songs.
For my first go I thought I'd try focusing on just the decade (70s, 80s, 90s) and the lyrics. Maybe certain words didn't appear until a certain timeframe or there are some interesting cultural references. I can import the data and get the word frequency lists and understand on a basic level how to use the association operators. However, I'm not sure what I need to do so that RapidMiner groups the text by years/decades. Will I be able to see easily that in different years/decades certain words appear together or at all? What operators should I use and what should my data be like? Is an Excel file with a separate row for each song sufficient?
Do you think this is even a good started project or will nothing interesting/useful come out of it?

Thanks in advance for any advice or pointers in the right direction.

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi GizaJack,

    first of all: any task in Data Mining needs some work and a rough knowledge of some Data Mining concepts, and there is almost no problem which you can solve "easily". To get an overview how to work with RapidMiner, you should have a look at our video tutorials which you can find on our webpage: http://rapid-i.com/content/view/189/212/lang,en/
    There you can also find an introduction to text mining with RapidMiner.

    Generally, an excel file with one row per song should be fine. You certainly have a column for each feature of a song, e.g. decade, maybe genre, and one big column for the complete lyrics of each song. If that is the case, you should be fine. It may be a good idea to import the data only once and store it in the RapidMiner repository for easier access.

    To filter examples with a certain value in one attribute, you can use the "Filter Examples" operator.

    For anything else please have a look at the tutorials. If you have any further questions, just ask!

    Cheers, Marius