"k-means Clustering which data belongs to which cluster?"

Carlo
Carlo New Altair Community Member
edited November 5 in Community Q&A
Hi Community,

I would like to cluster countries due to several factors like:  purchasing power, competition, turnover, Ease of doing business, tariffs, political stability etc. etc.
I am creating an Input list with the aim to have a numerical value for each and every factor (that makes it easier to cluster).
As Output I would like to have (let's say for example) 3 cluster and I would like to see which country belongs to wich cluster...
I am working currently with the k-means operator which works quite well but I am not able to see which country belongs to which cluster....

Does anybody has a suggestions?

Thanks a head.

Best regards,
Carlo

Best Answers

  • YYH
    YYH
    Altair Employee
    edited March 2019 Answer ✓
    Hi @Carlo,

    If you have a columns for country name or country code, you can set it as a special role (id/name). Also make sure you add a cluster label from k-means. Then the clustering model will return a data table with one reference columns for country name, another new column added for cluster label.

    I used the ICU patient data as example.


    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve ICU Morbidity (cour. Sven Van Poucke)" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Community Samples/Community Data Sets/Medical and Health/ICU Morbidity (cour. Sven Van Poucke)"/>
          </operator>
          <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="246" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="icustay_id"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="real"/>
            <parameter key="block_type" value="value_series"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_series_end"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role" width="90" x="380" y="34">
            <parameter key="attribute_name" value="icustay_id"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
            <description align="center" color="transparent" colored="false" width="126">icustay_id is an unique identifier for the patients</description>
          </operator>
          <operator activated="true" class="replace_missing_values" compatibility="9.2.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="581" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="gender"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="default" value="value"/>
            <list key="columns"/>
            <parameter key="replenishment_value" value="UNK"/>
          </operator>
          <operator activated="true" breakpoints="before" class="concurrency:k_means" compatibility="9.2.000" expanded="true" height="82" name="Clustering" width="90" x="715" y="34">
            <parameter key="add_cluster_attribute" value="true"/>
            <parameter key="add_as_label" value="true"/>
            <parameter key="remove_unlabeled" value="false"/>
            <parameter key="k" value="5"/>
            <parameter key="max_runs" value="10"/>
            <parameter key="determine_good_start_values" value="true"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="SquaredEuclideanDistance"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
            <parameter key="max_optimization_steps" value="100"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <connect from_op="Retrieve ICU Morbidity (cour. Sven Van Poucke)" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
          <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
          <connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    


    YY
  • YYH
    YYH
    Altair Employee
    Answer ✓
    Hi @Carlo,
    We can convert the region codes from nominal to dummy coding (nominal to numerical operator) and then multiply the region dummy code by 3, or multiply by 5 to change the range of the numerical region attributes to [0,5]. You would also need to apply some normalization on the other columns: purchasing power, competition, turnover, Ease of doing business, tariffs, political stability to make sure these normalized attributes have a smaller range, saying [0.1]. K-NN model with Chebyshev distance will take the region factor as the most important one since distance based clustering models are always sensitive to normalization. This kind of human-interference will increase the weight on region factor. You would need some testing on the multiply factor for region. To  get guaranteed results, fitting several clustering models on the subset for each region would be ideal.
    YY

Answers

  • YYH
    YYH
    Altair Employee
    edited March 2019 Answer ✓
    Hi @Carlo,

    If you have a columns for country name or country code, you can set it as a special role (id/name). Also make sure you add a cluster label from k-means. Then the clustering model will return a data table with one reference columns for country name, another new column added for cluster label.

    I used the ICU patient data as example.


    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve ICU Morbidity (cour. Sven Van Poucke)" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Community Samples/Community Data Sets/Medical and Health/ICU Morbidity (cour. Sven Van Poucke)"/>
          </operator>
          <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="246" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="icustay_id"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="real"/>
            <parameter key="block_type" value="value_series"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_series_end"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role" width="90" x="380" y="34">
            <parameter key="attribute_name" value="icustay_id"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
            <description align="center" color="transparent" colored="false" width="126">icustay_id is an unique identifier for the patients</description>
          </operator>
          <operator activated="true" class="replace_missing_values" compatibility="9.2.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="581" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="gender"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="default" value="value"/>
            <list key="columns"/>
            <parameter key="replenishment_value" value="UNK"/>
          </operator>
          <operator activated="true" breakpoints="before" class="concurrency:k_means" compatibility="9.2.000" expanded="true" height="82" name="Clustering" width="90" x="715" y="34">
            <parameter key="add_cluster_attribute" value="true"/>
            <parameter key="add_as_label" value="true"/>
            <parameter key="remove_unlabeled" value="false"/>
            <parameter key="k" value="5"/>
            <parameter key="max_runs" value="10"/>
            <parameter key="determine_good_start_values" value="true"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="SquaredEuclideanDistance"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
            <parameter key="max_optimization_steps" value="100"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <connect from_op="Retrieve ICU Morbidity (cour. Sven Van Poucke)" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
          <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
          <connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    


    YY
  • Carlo
    Carlo New Altair Community Member
    edited March 2019
    that is great! Works perfect! Thanks for your hint.

    One very last question I would have regarding this topic.
    In my data input I have countries from all over the world, but I should only cluster within several regions p.e. americas, apac, emea. So my output should be 2 clusters per region.

    My solution was: I splitted my input data ahead, before bringing it to rapidminer as a repository. So I have three repositories and I performed then the clustering with each of them.

    Is there the possibility to give rapidminer the hint to cluster only those countries togehter wich belongs to the same region (region is named in column b)?

    Thanks and best regards,
    Carlo
  • YYH
    YYH
    Altair Employee
    Answer ✓
    Hi @Carlo,
    We can convert the region codes from nominal to dummy coding (nominal to numerical operator) and then multiply the region dummy code by 3, or multiply by 5 to change the range of the numerical region attributes to [0,5]. You would also need to apply some normalization on the other columns: purchasing power, competition, turnover, Ease of doing business, tariffs, political stability to make sure these normalized attributes have a smaller range, saying [0.1]. K-NN model with Chebyshev distance will take the region factor as the most important one since distance based clustering models are always sensitive to normalization. This kind of human-interference will increase the weight on region factor. You would need some testing on the multiply factor for region. To  get guaranteed results, fitting several clustering models on the subset for each region would be ideal.
    YY