Nominal to numerical and nominal again, failed.

laavila
laavila New Altair Community Member
edited November 5 in Community Q&A
Hello there.

I have this problem.
I am working in a clustering problem, and I've already tried with a few of proposed solutions on other topics on the forum (yup, I did the search).  I have only Polynomial attributes, then I used the nominal to numerical operator (with unique integers) I used K-Means (mixed euclidean distance), and I joined the final example set with the original one (prior to the nominal to numerical).

After all this, at the final result, I've got is just the example set with unique integers values (I don't understand very well the data with this values on it).

I couldn't get the nominal values again. Does anyone have an idea what I am doing wrong?   
Thanks! 

ps: Here's my xml process too

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.0.001" expanded="true" height="68" name="Retrieve 2017-2018" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../SAMU (Jaime)/2017-2018"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="9.0.001" expanded="true" height="103" name="Subprocess" width="90" x="179" y="34">
        <process expanded="true">
          <operator activated="true" class="select_attributes" compatibility="9.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="FF|Fecha|H-420|H-Llegada S.U.|H-Pedido|H-Positivo|H-Salida|Móvil|S|OP|Origen|QTC|Actividad"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="9.0.001" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
            <parameter key="parameter_expression" value=""/>
            <parameter key="condition_class" value="custom_filters"/>
            <parameter key="invert_filter" value="false"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Clave.is_not_missing."/>
              <parameter key="filters_entry_key" value="Base.is_not_missing."/>
              <parameter key="filters_entry_key" value="Comuna.is_not_missing."/>
            </list>
            <parameter key="filters_logic_and" value="true"/>
            <parameter key="filters_check_metadata" value="true"/>
          </operator>
          <operator activated="true" class="replace_missing_values" compatibility="9.0.001" expanded="true" height="103" name="Replace Missing Values" width="90" x="313" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="QTC|Sexo|Origen|Destino|Actividad"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="default" value="value"/>
            <list key="columns"/>
            <parameter key="replenishment_value" value="unknow"/>
          </operator>
          <operator activated="true" class="replace" compatibility="9.0.001" expanded="true" height="82" name="Replace Sexo" width="90" x="179" y="187">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Sexo"/>
            <parameter key="attributes" value="Sexo|Destino"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="replace_what" value="0"/>
            <parameter key="replace_by" value="unknow"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="9.0.001" expanded="true" height="82" name="Generate ID" width="90" x="313" y="187">
            <parameter key="create_nominal_ids" value="false"/>
            <parameter key="offset" value="0"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.0.001" expanded="true" height="103" name="Multiply" width="90" x="447" y="238"/>
          <operator activated="true" class="nominal_to_numerical" compatibility="9.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="514" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="coding_type" value="unique integers"/>
            <parameter key="use_comparison_groups" value="false"/>
            <list key="comparison_groups"/>
            <parameter key="unexpected_value_handling" value="all 0 and warning"/>
            <parameter key="use_underscore_in_name" value="false"/>
          </operator>
          <connect from_port="in 1" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
          <connect from_op="Replace Missing Values" from_port="example set output" to_op="Replace Sexo" to_port="example set input"/>
          <connect from_op="Replace Sexo" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Nominal to Numerical" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 2" to_port="out 2"/>
          <connect from_op="Nominal to Numerical" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
          <portSpacing port="sink_out 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="normalize" compatibility="9.0.001" expanded="true" height="103" name="Normalize" width="90" x="313" y="34">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value="|id"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="method" value="Z-transformation"/>
        <parameter key="min" value="0.0"/>
        <parameter key="max" value="1.0"/>
        <parameter key="allow_negative_values" value="false"/>
      </operator>
      <operator activated="false" class="sample" compatibility="9.0.001" expanded="true" height="82" name="Sample" width="90" x="313" y="340">
        <parameter key="sample" value="absolute"/>
        <parameter key="balance_data" value="false"/>
        <parameter key="sample_size" value="5000"/>
        <parameter key="sample_ratio" value="0.1"/>
        <parameter key="sample_probability" value="0.1"/>
        <list key="sample_size_per_class"/>
        <list key="sample_ratio_per_class"/>
        <list key="sample_probability_per_class"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <operator activated="false" class="k_medoids" compatibility="9.0.001" expanded="true" height="82" name="K-Medoids" width="90" x="179" y="340">
        <parameter key="add_cluster_attribute" value="true"/>
        <parameter key="add_as_label" value="false"/>
        <parameter key="remove_unlabeled" value="false"/>
        <parameter key="k" value="5"/>
        <parameter key="max_runs" value="10"/>
        <parameter key="max_optimization_steps" value="100"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="measure_types" value="MixedMeasures"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="GeneralizedIDivergence"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
      </operator>
      <operator activated="true" class="concurrency:k_means" compatibility="9.0.001" expanded="true" height="82" name="K-Means" width="90" x="447" y="34">
        <parameter key="add_cluster_attribute" value="true"/>
        <parameter key="add_as_label" value="false"/>
        <parameter key="remove_unlabeled" value="false"/>
        <parameter key="k" value="5"/>
        <parameter key="max_runs" value="10"/>
        <parameter key="determine_good_start_values" value="false"/>
        <parameter key="measure_types" value="MixedMeasures"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="SquaredEuclideanDistance"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
        <parameter key="max_optimization_steps" value="100"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <operator activated="true" class="cluster_distance_performance" compatibility="9.0.001" expanded="true" height="103" name="Performance" width="90" x="581" y="34">
        <parameter key="main_criterion" value="Davies Bouldin"/>
        <parameter key="main_criterion_only" value="false"/>
        <parameter key="normalize" value="true"/>
        <parameter key="maximize" value="false"/>
      </operator>
      <operator activated="true" class="guess_types" compatibility="9.0.001" expanded="true" height="82" name="Guess Types" width="90" x="514" y="187">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value="Tipo|Sexo|Destino|Comuna|Clave|Base"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="decimal_point_character" value="."/>
      </operator>
      <connect from_op="Retrieve 2017-2018" from_port="output" to_op="Subprocess" to_port="in 1"/>
      <connect from_op="Subprocess" from_port="out 1" to_op="Normalize" to_port="example set input"/>
      <connect from_op="Normalize" from_port="example set output" to_op="K-Means" to_port="example set"/>
      <connect from_op="K-Means" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
      <connect from_op="K-Means" from_port="clustered set" to_op="Performance" to_port="example set"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance" from_port="example set" to_op="Guess Types" to_port="example set input"/>
      <connect from_op="Guess Types" from_port="example set output" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>


Best Answers

  • David_A
    David_A New Altair Community Member
    Answer ✓
    Hi,

    using Nominal to Numerical for clustering  is unfortunately the wrong approach, as the euclidean distance between the values does not have any real meaning.

    Take this simple example of matching a color to a number:

    • green -> 1
    • red -> 2
    • blue -> 3
    Now, when taking the distances, green becomes more different to blue than to red.

    But there are special distance measures for nominal values included in RapidMiner, so you don't have to transform your data at all. Just select measure types -> NominalMeasures and select a fitting function. You can also use an Optimize Parameter operator to find the function that delivers the best performance.

    To understand what the different methods are doing, take a look at the help text of the k-Means operator, where they are shortly explained. For nominal values they are the following:
    e: number of Attribute for which both Examples have equal and non-zero values 
    u: number of Attribute for which both Examples have not equal values 
    z: number of Attribute for which both Examples have zero values

    NominalDistance: Distance of two values is 0 if both values are the same and 1 otherwise.
    DiceSimilarity: With the above mentioned definitions the DiceSimilarity is: 2*e/(2*e+u)
    JaccardSimilarity: With the above mentioned definitions the JaccardSimilarity is: e/(e+u)
    KulczynskiSimilarity: With the above mentioned definitions the KulczynskiSimilarity is: e/u
    RogersTanimotoSimilarity: With the above mentioned definitions the RogersTanimotoSimilarity is: (e+z)/(e+2*u+z)
    RussellRaoSimilarity: With the above mentioned definitions the RussellRaoSimilarity is: e/(e+u+z)
    SimpleMatchingSimilarity: With the above mentioned definitions the SimpleMatchingSimilarity is: (e+z)/(e+u+z)

    I hope this helps you with your problem.

    Best,
    David
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    Hi,
    I personally Like to do:
    Nom2Numerical with Dummy Coding, PCA to get rid of correlations and then an kMeans in this space. Afterwards i join back the old data and figure out what makes which cluster using Decision Trees.
    BR,
    Martin

Answers

  • David_A
    David_A New Altair Community Member
    Answer ✓
    Hi,

    using Nominal to Numerical for clustering  is unfortunately the wrong approach, as the euclidean distance between the values does not have any real meaning.

    Take this simple example of matching a color to a number:

    • green -> 1
    • red -> 2
    • blue -> 3
    Now, when taking the distances, green becomes more different to blue than to red.

    But there are special distance measures for nominal values included in RapidMiner, so you don't have to transform your data at all. Just select measure types -> NominalMeasures and select a fitting function. You can also use an Optimize Parameter operator to find the function that delivers the best performance.

    To understand what the different methods are doing, take a look at the help text of the k-Means operator, where they are shortly explained. For nominal values they are the following:
    e: number of Attribute for which both Examples have equal and non-zero values 
    u: number of Attribute for which both Examples have not equal values 
    z: number of Attribute for which both Examples have zero values

    NominalDistance: Distance of two values is 0 if both values are the same and 1 otherwise.
    DiceSimilarity: With the above mentioned definitions the DiceSimilarity is: 2*e/(2*e+u)
    JaccardSimilarity: With the above mentioned definitions the JaccardSimilarity is: e/(e+u)
    KulczynskiSimilarity: With the above mentioned definitions the KulczynskiSimilarity is: e/u
    RogersTanimotoSimilarity: With the above mentioned definitions the RogersTanimotoSimilarity is: (e+z)/(e+2*u+z)
    RussellRaoSimilarity: With the above mentioned definitions the RussellRaoSimilarity is: e/(e+u+z)
    SimpleMatchingSimilarity: With the above mentioned definitions the SimpleMatchingSimilarity is: (e+z)/(e+u+z)

    I hope this helps you with your problem.

    Best,
    David
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    Hi,
    I personally Like to do:
    Nom2Numerical with Dummy Coding, PCA to get rid of correlations and then an kMeans in this space. Afterwards i join back the old data and figure out what makes which cluster using Decision Trees.
    BR,
    Martin
  • laavila
    laavila New Altair Community Member
    Hello David, I really appreciate your response. It was very helpfull.

    I managed to run the K-Means operator with your specifications, but I'm not sure how  to evaluate the perfomance for the whole process. I have tried with Cluster distance operator, and cluster count perfomance, but I'm not sure about which element is the best to evaluate perfomance from the clusterization operator. I have tried with Optimize parameter, as you said, but I couldn't use it. 

    Thanks for your comments again.