Using Map Clustering on Labels to align cluster predictions with ground-truth

User: "nicolas_richard"
New Altair Community Member
Updated by Jocelyn
Hello everybody,

My name is Nicolas, I'm a french student working on a research study.

To introduce my problem here is a little background : My topic relates to the discovery of communities in social networks. I'm doing this by clustering the graph of friends of a given user.

Once I clustered the graph I need to align the discovered clusters on the ground truth communities in order to evaluate my method. In simpler words: I want to compare the results of a clustering function to the real clusters. In order to do so I need to find which discovered cluster matches which real cluster.

I'm trying to use the "Map Clustering on Labels" function.

I've created a simple example to get to know this function.
1. I generate some data
2. I multiply this data source
3. I apply to different clustering algorithm on it. (K-means and K-medoids, searching for 3 clusters)
=> I want to compare the results (Ideally, find the precision and recall for each cluster as well as the Balance Error Rate)
4. I use "Extract Cluster Prototypes" to  convert the result of my second clustering from a "model" to an "example set".

My problem is that it doesn't work as expected and I don't understand the error I get.

This make me realize that I may not really understand how to use "Map Clustering on Labels" properly. I think it's the most appropriate function to do what I want to do, maybe it's not. All your remarks will be greatly appreciated.


Here is my process :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="431" width="1016">
      <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="165">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|a4|a3|a2|a1"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="313" y="30">
        <parameter key="k" value="3"/>
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
      </operator>
      <operator activated="true" class="replace" compatibility="5.2.008" expanded="true" height="76" name="Replace" width="90" x="313" y="300">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="id"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="replace_what" value="id_(.*)"/>
        <parameter key="replace_by" value="$1"/>
      </operator>
      <operator activated="true" class="guess_types" compatibility="5.2.008" expanded="true" height="76" name="Guess Types" width="90" x="447" y="300">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="id"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="guess_types" compatibility="5.2.008" expanded="true" height="76" name="Guess Types (2)" width="90" x="447" y="165">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="id"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="join" compatibility="5.2.008" expanded="true" height="76" name="Join" width="90" x="581" y="120">
        <list key="key_attributes"/>
      </operator>
      <operator activated="true" class="map_clustering_on_labels" compatibility="5.2.008" expanded="true" height="76" name="Map Clustering on Labels" width="90" x="715" y="30"/>
      <operator activated="true" class="performance" compatibility="5.2.008" expanded="true" height="76" name="Performance" width="90" x="849" y="30"/>
      <connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Select Attributes" from_port="original" to_op="Replace" to_port="example set input"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Map Clustering on Labels" to_port="cluster model"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Guess Types (2)" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Guess Types" to_port="example set input"/>
      <connect from_op="Guess Types" from_port="example set output" to_op="Join" to_port="right"/>
      <connect from_op="Guess Types (2)" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Guess Types (2)" from_port="original" to_port="result 2"/>
      <connect from_op="Join" from_port="join" to_op="Map Clustering on Labels" to_port="example set"/>
      <connect from_op="Map Clustering on Labels" from_port="example set" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>


This is the error I get : (from the log file)

"
Feb 20, 2013 12:20:15 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 20, 2013 12:20:15 PM SEVERE: Here:          Process[1] (Process)
          subprocess 'Main Process'
            +- Generate Data[1] (Generate Data)
            +- Multiply[1] (Multiply)
            +- Clustering (2)[1] (k-Medoids)
            +- Extract Cluster Prototypes[1] (Extract Cluster Prototypes)
            +- Clustering[1] (k-Means)
      ==>  +- Map Clustering on Labels[1] (Map Clustering on Labels)
Feb 20, 2013 12:20:15 PM SEVERE: java.lang.NullPointerException
"

Thanks a lot for your help.

Nicolas

Find more posts tagged with