How to reconvert from numerical to nominal
jctorresp
New Altair Community Member
Hi,
I am making my thesis about data mining so I had to convert some data from nominal to numerical, after that I exported this data to csv and process in python. But now, I have a new order in data and I need convert again in nominal values, I was searching how save a map or something like this with the original conversion, example:
column genre:
male->1
female->2
other->3
If I'd had that mapper I can reconvert from nominal to numerical, but I couldn't find a way to do that.
Is necessary indicate that I had to convert several columns so I nee something like a map by each column.
Thanks for your help
I am making my thesis about data mining so I had to convert some data from nominal to numerical, after that I exported this data to csv and process in python. But now, I have a new order in data and I need convert again in nominal values, I was searching how save a map or something like this with the original conversion, example:
column genre:
male->1
female->2
other->3
If I'd had that mapper I can reconvert from nominal to numerical, but I couldn't find a way to do that.
Is necessary indicate that I had to convert several columns so I nee something like a map by each column.
Thanks for your help
Tagged:
0
Best Answer
-
If you have a limited number of nominal values you could use the replace with dictionary option. This way you can control the numeric value yourself. To revert you can then use the same logic but the other way around (switch from and to)
As in below example :<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.001" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="from,to male,1 female,2 other,3"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="replace_dictionary" compatibility="9.2.001" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="380" y="34"> <parameter key="return_preprocessing_model" value="false"/> <parameter key="create_view" value="false"/> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="myField"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="from_attribute" value="from"/> <parameter key="to_attribute" value="to"/> <parameter key="use_regular_expressions" value="false"/> <parameter key="convert_to_lowercase" value="false"/> <parameter key="first_match_only" value="false"/> </operator> <connect from_port="input 1" to_op="Replace (Dictionary)" to_port="example set input"/> <connect from_op="Create ExampleSet" from_port="output" to_op="Replace (Dictionary)" to_port="dictionary"/> <connect from_op="Replace (Dictionary)" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
3
Answers
-
Hi @jctorresp
Did you look at the map operator in RM? This can be applied to both numerical and nominal values.
0 -
If you have a limited number of nominal values you could use the replace with dictionary option. This way you can control the numeric value yourself. To revert you can then use the same logic but the other way around (switch from and to)
As in below example :<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.001" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="from,to male,1 female,2 other,3"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="replace_dictionary" compatibility="9.2.001" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="380" y="34"> <parameter key="return_preprocessing_model" value="false"/> <parameter key="create_view" value="false"/> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="myField"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="from_attribute" value="from"/> <parameter key="to_attribute" value="to"/> <parameter key="use_regular_expressions" value="false"/> <parameter key="convert_to_lowercase" value="false"/> <parameter key="first_match_only" value="false"/> </operator> <connect from_port="input 1" to_op="Replace (Dictionary)" to_port="example set input"/> <connect from_op="Create ExampleSet" from_port="output" to_op="Replace (Dictionary)" to_port="dictionary"/> <connect from_op="Replace (Dictionary)" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
3 -
The problem is that I have 10 columns and in each column can have differents values. Some columns have around 7 possible values. And I need to do the same process with other set data, so is so hard have to set up manually a dictionary by each one. Finally I think that I will export the result of the nominal to numerical operator and I will go this process in python0
-
Not to throw a monkey wrench in here, but why did you need to convert nominal data to integer coding in the first place? Doing it in the way you have described is usually not recommended for truly nominal data (like gender) rather than ordinal data because it implies numerical relationships that don't actually exist in the underlying categories if you are using coefficient based algorithms. So you should probably be using dummy coding or effect coding instead of integer coding in the first place.3
-
I am working with clustering. I need separate the data in different cluster but the most columns of the data are categorical data so I had to use k-modes that is a variation of the k-means algorithm, but the first step in that is convert data to numerical to improve the process.
0 -
If you do the conversion to integer coding then you are not representing the data in a consistent way with nominal categories. For example, If you have 4 nominal categories where the underlying data is not ordinal in any way (like the colors red, green, yellow, and blue) and you then recode them as {1,2,3,4} and then use that numerical value in any distance calculation, you are basically saying that the 1st and 4th values are much farther apart than the 2nd and 3rd values, when that isn't the case.
In RapidMiner, both k-medoids (I assume that is what you are referring to, there is no k-mode) and k-means operators both handle nominal data just fine. Just set the distance measure types parameter to Mixed Measures and also make sure you normalize your other numerical data (which you should do anyways whenever you are doing distance calculations).
0