Aggregating Categorical Values - Music Genre
zsteiner
New Altair Community Member
I'm working with a dataset that has multiple genres per entry. For example, one row might have g1 = rap, g2 = demotrack, g3 = polish trap. None of these genres can be said to be the "primary" genre, so all need to be retained. I am attempting to train the set to predict the genre value, but am having a hard time finding a way to make a single genre column with multiple values per row. Is there a way to do this? Any suggestions are appreciated and I am happy to clarify.
0
Best Answer
-
As Rodrigo said, you need to transform your data so you have only a single genre column but the same song can appear multiple times. This will allow you to build a single model to predict genre.
To accomplish this in RapidMiner, you need to De-Pivot. See the attached example process which works with your sample data (change the path to your data file in the Read CSV operator first).<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="120"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="read_csv" compatibility="9.2.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34"> <parameter key="csv_file" value="C:\Users\brian\Downloads\sample.csv"/> <parameter key="column_separators" value=","/> <parameter key="trim_lines" value="false"/> <parameter key="use_quotes" value="true"/> <parameter key="quotes_character" value="""/> <parameter key="escape_character" value="\"/> <parameter key="skip_comments" value="true"/> <parameter key="comment_characters" value="#"/> <parameter key="starting_row" value="1"/> <parameter key="parse_numbers" value="true"/> <parameter key="decimal_character" value="."/> <parameter key="grouped_digits" value="false"/> <parameter key="grouping_character" value=","/> <parameter key="infinity_representation" value=""/> <parameter key="date_format" value=""/> <parameter key="first_row_as_names" value="true"/> <list key="annotations"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="encoding" value="windows-1252"/> <parameter key="read_all_values_as_polynominal" value="false"/> <list key="data_set_meta_data_information"> <parameter key="0" value="artist.true.polynominal.attribute"/> <parameter key="1" value="genre_1.true.polynominal.attribute"/> <parameter key="2" value="genre_2.true.polynominal.attribute"/> <parameter key="3" value="genre_3.true.polynominal.attribute"/> <parameter key="4" value="genre_4.true.polynominal.attribute"/> <parameter key="5" value="genre_5.true.polynominal.attribute"/> <parameter key="6" value="genre_6.true.polynominal.attribute"/> <parameter key="7" value="genre_7.true.polynominal.attribute"/> <parameter key="8" value="genre_8.true.polynominal.attribute"/> <parameter key="9" value="genre_9.true.polynominal.attribute"/> <parameter key="10" value="genre_10.true.polynominal.attribute"/> <parameter key="11" value="genre_11.true.polynominal.attribute"/> <parameter key="12" value="genre_12.true.polynominal.attribute"/> <parameter key="13" value="genre_13.true.polynominal.attribute"/> <parameter key="14" value="genre_14.true.polynominal.attribute"/> <parameter key="15" value="genre_15.true.polynominal.attribute"/> <parameter key="16" value="genre_16.true.polynominal.attribute"/> <parameter key="17" value="genre_17.true.polynominal.attribute"/> <parameter key="18" value="genre_18.true.polynominal.attribute"/> <parameter key="19" value="genre_19.true.polynominal.attribute"/> <parameter key="20" value="genre_20.true.polynominal.attribute"/> <parameter key="21" value="genre_21.true.polynominal.attribute"/> <parameter key="22" value="genre_22.true.polynominal.attribute"/> <parameter key="23" value="genre_23.true.polynominal.attribute"/> <parameter key="24" value="genre_24.true.polynominal.attribute"/> <parameter key="25" value="genre_25.true.polynominal.attribute"/> <parameter key="26" value="title.true.polynominal.attribute"/> <parameter key="27" value="energy.true.real.attribute"/> <parameter key="28" value="liveness.true.real.attribute"/> <parameter key="29" value="speechiness.true.real.attribute"/> <parameter key="30" value="valence.true.real.attribute"/> <parameter key="31" value="acousticness.true.real.attribute"/> <parameter key="32" value="instrumentalness.true.real.attribute"/> <parameter key="33" value="danceability.true.real.attribute"/> <parameter key="34" value="time_signature.true.real.attribute"/> <parameter key="35" value="key.true.real.attribute"/> <parameter key="36" value="duration_ms.true.real.attribute"/> <parameter key="37" value="loudness.true.real.attribute"/> <parameter key="38" value="tempo.true.real.attribute"/> <parameter key="39" value="mode.true.real.attribute"/> </list> <parameter key="read_not_matching_values_as_missings" value="false"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="de_pivot" compatibility="9.2.000" expanded="true" height="82" name="De-Pivot" width="90" x="179" y="34"> <list key="attribute_name"> <parameter key="genre" value="genre.+"/> </list> <parameter key="index_attribute" value="index"/> <parameter key="create_nominal_index" value="false"/> <parameter key="keep_missings" value="false"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role" width="90" x="313" y="34"> <parameter key="attribute_name" value="genre"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="De-Pivot" to_port="example set input"/> <connect from_op="De-Pivot" from_port="example set output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
5
Answers
-
Hi @zsteiner,
Can you share a sample of your dataset and from this sample give an example of what you want to obtain ?
thanks you
Regards,
Lionel0 -
@lionelderkrikor
Be happy to. My goal is to find song attributes that can be used to predict song/artist genre. A single song/artist pair can be described by more than one genre at a time, with none being more "correct" than another. In the attached data sample for instance, Empire of the Sun can be categorized as electropop, indietronic, and new rave simultaneously. This is why they should be listed together in a single field and not as "genre_1", "genre_2", because there is no inherent order here.
I want to train a model to predict the genre of a song using all of an artist's genres as training targets variables. However, if I were to combine all into one "genre" column, the model will treat each combination, however similar, as a different target. For example, the model will treat the artist genre arrays [rock, grunge, nu-metal] and [nu-metal, grunge, indie-rock] as totally distinct responses, despite being virtually identical.
I'm looking for a way that I can train a model using all of a song's genres, but to receive only a single genre as prediction output. So, is there a way to have distinct multiple genres in a single column that won't be treated as a single value?0 -
Hi @zsteiner,I would do this:
- Put the same song from your training data with different genres on each row, like this:
Deep Purple, Child In Time, Rock, ...Deep Purple, Child In Time, Ballad, ...Deep Purple, Smoke on the Water, Rock, ...- Create a list of genres (select attributes and filter duplicates might do the work).
- Use loops to train one or a few algorithms per genre (e. g., one for rock, one for pop, one for jazz...). You could use "Validate" and "Optimize" to get the best results for each. Probably Naïve Bayes sounds good.
So, if a song is in A minor and it's 5 / 4, it will never ever be a Cumbia, but it can be Jazz, Rock or Classical.I made a few things in the past using this approach and it works reasonably well. Hint. It's not a 5-minute work but more of a 3-hours one.Hope this helps. Will elaborate more once I get my AC adapter.Rodrigo.3 - Put the same song from your training data with different genres on each row, like this:
-
As Rodrigo said, you need to transform your data so you have only a single genre column but the same song can appear multiple times. This will allow you to build a single model to predict genre.
To accomplish this in RapidMiner, you need to De-Pivot. See the attached example process which works with your sample data (change the path to your data file in the Read CSV operator first).<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="120"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="read_csv" compatibility="9.2.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34"> <parameter key="csv_file" value="C:\Users\brian\Downloads\sample.csv"/> <parameter key="column_separators" value=","/> <parameter key="trim_lines" value="false"/> <parameter key="use_quotes" value="true"/> <parameter key="quotes_character" value="""/> <parameter key="escape_character" value="\"/> <parameter key="skip_comments" value="true"/> <parameter key="comment_characters" value="#"/> <parameter key="starting_row" value="1"/> <parameter key="parse_numbers" value="true"/> <parameter key="decimal_character" value="."/> <parameter key="grouped_digits" value="false"/> <parameter key="grouping_character" value=","/> <parameter key="infinity_representation" value=""/> <parameter key="date_format" value=""/> <parameter key="first_row_as_names" value="true"/> <list key="annotations"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="encoding" value="windows-1252"/> <parameter key="read_all_values_as_polynominal" value="false"/> <list key="data_set_meta_data_information"> <parameter key="0" value="artist.true.polynominal.attribute"/> <parameter key="1" value="genre_1.true.polynominal.attribute"/> <parameter key="2" value="genre_2.true.polynominal.attribute"/> <parameter key="3" value="genre_3.true.polynominal.attribute"/> <parameter key="4" value="genre_4.true.polynominal.attribute"/> <parameter key="5" value="genre_5.true.polynominal.attribute"/> <parameter key="6" value="genre_6.true.polynominal.attribute"/> <parameter key="7" value="genre_7.true.polynominal.attribute"/> <parameter key="8" value="genre_8.true.polynominal.attribute"/> <parameter key="9" value="genre_9.true.polynominal.attribute"/> <parameter key="10" value="genre_10.true.polynominal.attribute"/> <parameter key="11" value="genre_11.true.polynominal.attribute"/> <parameter key="12" value="genre_12.true.polynominal.attribute"/> <parameter key="13" value="genre_13.true.polynominal.attribute"/> <parameter key="14" value="genre_14.true.polynominal.attribute"/> <parameter key="15" value="genre_15.true.polynominal.attribute"/> <parameter key="16" value="genre_16.true.polynominal.attribute"/> <parameter key="17" value="genre_17.true.polynominal.attribute"/> <parameter key="18" value="genre_18.true.polynominal.attribute"/> <parameter key="19" value="genre_19.true.polynominal.attribute"/> <parameter key="20" value="genre_20.true.polynominal.attribute"/> <parameter key="21" value="genre_21.true.polynominal.attribute"/> <parameter key="22" value="genre_22.true.polynominal.attribute"/> <parameter key="23" value="genre_23.true.polynominal.attribute"/> <parameter key="24" value="genre_24.true.polynominal.attribute"/> <parameter key="25" value="genre_25.true.polynominal.attribute"/> <parameter key="26" value="title.true.polynominal.attribute"/> <parameter key="27" value="energy.true.real.attribute"/> <parameter key="28" value="liveness.true.real.attribute"/> <parameter key="29" value="speechiness.true.real.attribute"/> <parameter key="30" value="valence.true.real.attribute"/> <parameter key="31" value="acousticness.true.real.attribute"/> <parameter key="32" value="instrumentalness.true.real.attribute"/> <parameter key="33" value="danceability.true.real.attribute"/> <parameter key="34" value="time_signature.true.real.attribute"/> <parameter key="35" value="key.true.real.attribute"/> <parameter key="36" value="duration_ms.true.real.attribute"/> <parameter key="37" value="loudness.true.real.attribute"/> <parameter key="38" value="tempo.true.real.attribute"/> <parameter key="39" value="mode.true.real.attribute"/> </list> <parameter key="read_not_matching_values_as_missings" value="false"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="de_pivot" compatibility="9.2.000" expanded="true" height="82" name="De-Pivot" width="90" x="179" y="34"> <list key="attribute_name"> <parameter key="genre" value="genre.+"/> </list> <parameter key="index_attribute" value="index"/> <parameter key="create_nominal_index" value="false"/> <parameter key="keep_missings" value="false"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role" width="90" x="313" y="34"> <parameter key="attribute_name" value="genre"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="De-Pivot" to_port="example set input"/> <connect from_op="De-Pivot" from_port="example set output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
5