Using Map Clustering on Labels to align cluster predictions with ground-truth
Hello everybody,
My name is Nicolas, I'm a french student working on a research study.
To introduce my problem here is a little background : My topic relates to the discovery of communities in social networks. I'm doing this by clustering the graph of friends of a given user.
Once I clustered the graph I need to align the discovered clusters on the ground truth communities in order to evaluate my method. In simpler words: I want to compare the results of a clustering function to the real clusters. In order to do so I need to find which discovered cluster matches which real cluster.
I'm trying to use the "Map Clustering on Labels" function.
I've created a simple example to get to know this function.
1. I generate some data
2. I multiply this data source
3. I apply to different clustering algorithm on it. (K-means and K-medoids, searching for 3 clusters)
=> I want to compare the results (Ideally, find the precision and recall for each cluster as well as the Balance Error Rate)
4. I use "Extract Cluster Prototypes" to convert the result of my second clustering from a "model" to an "example set".
My problem is that it doesn't work as expected and I don't understand the error I get.
This make me realize that I may not really understand how to use "Map Clustering on Labels" properly. I think it's the most appropriate function to do what I want to do, maybe it's not. All your remarks will be greatly appreciated.
Here is my process :
This is the error I get : (from the log file)
"
Feb 20, 2013 12:20:15 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 20, 2013 12:20:15 PM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
+- Generate Data[1] (Generate Data)
+- Multiply[1] (Multiply)
+- Clustering (2)[1] (k-Medoids)
+- Extract Cluster Prototypes[1] (Extract Cluster Prototypes)
+- Clustering[1] (k-Means)
==> +- Map Clustering on Labels[1] (Map Clustering on Labels)
Feb 20, 2013 12:20:15 PM SEVERE: java.lang.NullPointerException
"
Thanks a lot for your help.
Nicolas
My name is Nicolas, I'm a french student working on a research study.
To introduce my problem here is a little background : My topic relates to the discovery of communities in social networks. I'm doing this by clustering the graph of friends of a given user.
Once I clustered the graph I need to align the discovered clusters on the ground truth communities in order to evaluate my method. In simpler words: I want to compare the results of a clustering function to the real clusters. In order to do so I need to find which discovered cluster matches which real cluster.
I'm trying to use the "Map Clustering on Labels" function.
I've created a simple example to get to know this function.
1. I generate some data
2. I multiply this data source
3. I apply to different clustering algorithm on it. (K-means and K-medoids, searching for 3 clusters)
=> I want to compare the results (Ideally, find the precision and recall for each cluster as well as the Balance Error Rate)
4. I use "Extract Cluster Prototypes" to convert the result of my second clustering from a "model" to an "example set".
My problem is that it doesn't work as expected and I don't understand the error I get.
This make me realize that I may not really understand how to use "Map Clustering on Labels" properly. I think it's the most appropriate function to do what I want to do, maybe it's not. All your remarks will be greatly appreciated.
Here is my process :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="431" width="1016">
<operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="165">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|a4|a3|a2|a1"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="313" y="30">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator activated="true" class="replace" compatibility="5.2.008" expanded="true" height="76" name="Replace" width="90" x="313" y="300">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="id"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="replace_what" value="id_(.*)"/>
<parameter key="replace_by" value="$1"/>
</operator>
<operator activated="true" class="guess_types" compatibility="5.2.008" expanded="true" height="76" name="Guess Types" width="90" x="447" y="300">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="id"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="guess_types" compatibility="5.2.008" expanded="true" height="76" name="Guess Types (2)" width="90" x="447" y="165">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="id"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="join" compatibility="5.2.008" expanded="true" height="76" name="Join" width="90" x="581" y="120">
<list key="key_attributes"/>
</operator>
<operator activated="true" class="map_clustering_on_labels" compatibility="5.2.008" expanded="true" height="76" name="Map Clustering on Labels" width="90" x="715" y="30"/>
<operator activated="true" class="performance" compatibility="5.2.008" expanded="true" height="76" name="Performance" width="90" x="849" y="30"/>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Select Attributes" from_port="original" to_op="Replace" to_port="example set input"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Map Clustering on Labels" to_port="cluster model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Guess Types (2)" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_op="Guess Types" to_port="example set input"/>
<connect from_op="Guess Types" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Guess Types (2)" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Guess Types (2)" from_port="original" to_port="result 2"/>
<connect from_op="Join" from_port="join" to_op="Map Clustering on Labels" to_port="example set"/>
<connect from_op="Map Clustering on Labels" from_port="example set" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
This is the error I get : (from the log file)
"
Feb 20, 2013 12:20:15 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 20, 2013 12:20:15 PM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
+- Generate Data[1] (Generate Data)
+- Multiply[1] (Multiply)
+- Clustering (2)[1] (k-Medoids)
+- Extract Cluster Prototypes[1] (Extract Cluster Prototypes)
+- Clustering[1] (k-Means)
==> +- Map Clustering on Labels[1] (Map Clustering on Labels)
Feb 20, 2013 12:20:15 PM SEVERE: java.lang.NullPointerException
"
Thanks a lot for your help.
Nicolas