Music Lyrics Analyzer: how to handle repeated lyrics?

mt_12345
mt_12345 New Altair Community Member
edited November 5 in Community Q&A

Hey guys,

I'm currently working on an automatic Music Lyrics Analyzer. The MLA uses text analytics methods based on an established platform to analyze the vocabulary used in song lyrics of different interpreters / genres and build clusters of songs based on their lyrics. In many songs, some sections of lyrics are repeated twice, indicated by a string string “x2".

 

In my opinion, I have to account for those repetition to avoid screwed classification model's results. Do you agree? If yes, how to handle this? Which operators should I choose?

 

Many thanks for your help! Have a good day!

 

Tagged:

Answers

  • sgenzer
    sgenzer
    Altair Employee

    hmm I'm not really sure about whether or not you should be weighting the repetitions or not but if you use tokenization and TFIDF, the repetitions will be weighted accordingly anyway.


    Scott

     

  • mt_12345
    mt_12345 New Altair Community Member

    Thanks a lot for your answer. I will try it out!

     

    Cheers

  • mt_12345
    mt_12345 New Altair Community Member

    Just to make sure that everyone gets my question right: The repetitions are only indicated by a string x2, the text itself is not included twice in the songtext. So we have to do some transformations that the text really appears twice, right? Any ideas how we can do this? 

    I think what Scott suggested is what comes one step later. 

    Thanks

  • David_A
    David_A New Altair Community Member

    Hi,

     

    it depends a bit on how the lyrics are returned. One token per line or stanza. If this is the case you can play with regular expressions and the replace Operator.

    Perhaps a bit cumbersome, but something like this should do the trick:

    • Replace what: (.+) x2
    • Replace with: $1 $1

    Then you can repeat that pattern for x3, x4, ... 

     

    Hope this helps.

     

  • kayman
    kayman New Altair Community Member

    Regular expressions are probably the best approach here indeed, but the quality will depend on your original data. The one given by David would work already to some extend but since it's greedy it can strip too much data if you have multiple x2's in your data. If your structure is as follows (so with linebreaks) :

     

    some sentence

    another sentence x2

    yet again another sentence

    and some other x2

     

    The regular expression that will work best in that case is (?m)^(.*?) x2$

     

    Roughly translated this means for any line you see start at the beginning and then group everything that appears untill the first time you see x2.

     

    So replace it then with $1 $1 will give you the same string twice. If there is no x2 in the strin/line it will simply keep the original.

     

    if everything is in one line (.*?) x2 will do fine also, but ensure you use the questionmark if you have more than one time x2 in your string. This will ensure the capture stops as soon as it finds an x2, otherwise it will take everything untill the last time it finds an x2

     

    Note that if your 2x would be in parantheses it will become like this (.*?) \(x2\)

  • mt_12345
    mt_12345 New Altair Community Member

    Thanks a lot guys! I need to try it out to see if the results are satisfying.