Music Lyrics Analyzer: how to handle repeated lyrics?
Hey guys,
I'm currently working on an automatic Music Lyrics Analyzer. The MLA uses text analytics methods based on an established platform to analyze the vocabulary used in song lyrics of different interpreters / genres and build clusters of songs based on their lyrics. In many songs, some sections of lyrics are repeated twice, indicated by a string string “x2".
In my opinion, I have to account for those repetition to avoid screwed classification model's results. Do you agree? If yes, how to handle this? Which operators should I choose?
Many thanks for your help! Have a good day!
Answers
-
hmm I'm not really sure about whether or not you should be weighting the repetitions or not but if you use tokenization and TFIDF, the repetitions will be weighted accordingly anyway.
Scott0 -
Thanks a lot for your answer. I will try it out!
Cheers
0 -
Just to make sure that everyone gets my question right: The repetitions are only indicated by a string x2, the text itself is not included twice in the songtext. So we have to do some transformations that the text really appears twice, right? Any ideas how we can do this?
I think what Scott suggested is what comes one step later.
Thanks
0 -
Hi,
it depends a bit on how the lyrics are returned. One token per line or stanza. If this is the case you can play with regular expressions and the replace Operator.
Perhaps a bit cumbersome, but something like this should do the trick:
- Replace what: (.+) x2
- Replace with: $1 $1
Then you can repeat that pattern for x3, x4, ...
Hope this helps.
2 -
Regular expressions are probably the best approach here indeed, but the quality will depend on your original data. The one given by David would work already to some extend but since it's greedy it can strip too much data if you have multiple x2's in your data. If your structure is as follows (so with linebreaks) :
some sentence
another sentence x2
yet again another sentence
and some other x2
The regular expression that will work best in that case is (?m)^(.*?) x2$
Roughly translated this means for any line you see start at the beginning and then group everything that appears untill the first time you see x2.
So replace it then with $1 $1 will give you the same string twice. If there is no x2 in the strin/line it will simply keep the original.
if everything is in one line (.*?) x2 will do fine also, but ensure you use the questionmark if you have more than one time x2 in your string. This will ensure the capture stops as soon as it finds an x2, otherwise it will take everything untill the last time it finds an x2
Note that if your 2x would be in parantheses it will become like this (.*?) \(x2\)
2 -
Thanks a lot guys! I need to try it out to see if the results are satisfying.
1