Clustering with labels?
Hi,
is there any way to do clustering with labels to control performance (in classification)? what operator can I use to do that (e.g with k-means?)
and is there some way to cluster the data with the "help" from labels if the class is known, so I mean clustering based on given labels (e.g find out which class label is clustered together, and then get the centroid of that local cluster and so on... ?)
Is there some operator existent that uses labels for clustering? I just want to find out some more properties about my dataset and my classes (e.g local cluster labels centroid tables... etc.)
Best Answer
-
did you try Map Clustering on Labels and then the performance operators?
0
Answers
-
If you have labeled data, most of the time clustering is bring owls to Athens....
Of course you can use 'set role' to make lable column to normal regular attributes and pretend to not have any label information. Use the data without special attribute 'label' you can do any clustering you want.
Hope that makes senses...
0 -
I know the purpose of clustering, but I want to compare the found clusters with labeled "clusters" if you know what I mean, to find the "goodness" of clusters by comparing them with some ground truth...
any sophisticated way to do so? any ideas?
0 -
did you try Map Clustering on Labels and then the performance operators?
0 -
yeah thanks, that seemed to work, but I still don't know how that operator works,
how is it choosing which cluster is what label?
0 -
Mh, good question. The important code is in ClusterToPrediction.java - but it's quite a chunk.
@Override
public void doWork() throws OperatorException {
ExampleSet exampleSet = exampleSetInput.getData(ExampleSet.class);
ClusterModel model = clusterModelInput.getData(ClusterModel.class);
// generate the predicted attribute
Attribute labelAttribute = exampleSet.getAttributes().getLabel();
PredictionModel.createPredictedLabel(exampleSet, labelAttribute);
Attribute predictedLabel = exampleSet.getAttributes().getPredictedLabel();
HashMap<Integer, String> intToClusterMapping = new HashMap<Integer, String>();
int[][] mappingTable = new int[model.getNumberOfClusters()][model.getNumberOfClusters()];
// count the occurrence of each label with every cluster
int a = 0;
for (int i = 0; i < model.getNumberOfClusters(); i++) {
HashMap<String, Integer> labelOccurrence = new HashMap<String, Integer>();
for (Example example : exampleSet) {
String label = example.getValueAsString(labelAttribute);
if (!labelOccurrence.containsKey(label)) {
labelOccurrence.put(label, 0);
if (i == 0) {
intToClusterMapping.put(a, label);
a++;
}
}
if (example.getValue(example.getAttributes().getCluster()) == i) {
labelOccurrence.put(label, labelOccurrence.get(label) + 1);
}
}
if (i == 0 && model.getNumberOfClusters() != labelOccurrence.size()) {
throw new UserError(this, 943, labelOccurrence.size(), model.getNumberOfClusters());
}
for (int j = 0; j < mappingTable[i].length; j++) {
String clusterName = intToClusterMapping.get(j);
int occ = labelOccurrence.get(clusterName);
mappingTable[i][j] = occ;
}
}
/*
* Munkres-algorithm or the hungarian method
*/
// find the maximum
int maxValue = -1;
for (int i = 0; i < mappingTable.length; i++) {
for (int j = 0; j < mappingTable[i].length; j++) {
if (mappingTable[i][j] > maxValue) {
maxValue = mappingTable[i][j];
}
}
}
// compute the new (inverted) table (and column-minima)
for (int i = 0; i < mappingTable.length; i++) {
int minimum = Integer.MAX_VALUE;
for (int j = 0; j < mappingTable[i].length; j++) {
mappingTable[i][j] = maxValue - mappingTable[i][j];
if (mappingTable[i][j] < minimum) {
minimum = mappingTable[i][j];
}
}
// subtract the column-minima
if (minimum > 0) {
for (int j = 0; j < mappingTable[i].length; j++) {
mappingTable[i][j] = mappingTable[i][j] - minimum;
}
}
}
// compute and subtract the row-minima
for (int i = 0; i < mappingTable[0].length; i++) {
int minimum = Integer.MAX_VALUE;
for (int j = 0; j < mappingTable.length; j++) {
if (mappingTable[j][i] < minimum) {
minimum = mappingTable[j][i];
}
}
// subtract the row-minima
if (minimum > 0) {
for (int j = 0; j < mappingTable.length; j++) {
mappingTable[j][i] = mappingTable[j][i] - minimum;
}
}
}
while (!assignmentAvailable(mappingTable)) {
Vector<Integer> markedRows = new Vector<Integer>();
Vector<Integer> markedColumns = new Vector<Integer>();
// mark all rows which have no marked zero (start labeling)
for (int i = 0; i < mappingTable[0].length; i++) {
boolean markedZero = false;
for (int j = 0; j < mappingTable.length; j++) {
if (mappingTable[j][i] == Integer.MIN_VALUE) {
markedZero = true;
break;
}
}
if (!markedZero) {
markedRows.add(i);
}
}
boolean newMarked = true;
while (newMarked) {
newMarked = false;
// mark all columns with a slashed zero in a marked row
for (int i = 0; i < mappingTable.length; i++) {
for (int j = 0; j < mappingTable[i].length; j++) {
if (mappingTable[i][j] == Integer.MAX_VALUE) {
if (markedRows.contains(j) && !markedColumns.contains(i)) {
newMarked = true;
markedColumns.add(i);
}
}
}
}
// mark all rows with a marked zero in a marked column
for (int i = 0; i < mappingTable[0].length; i++) {
for (int j = 0; j < mappingTable.length; j++) {
if (mappingTable[j][i] == Integer.MIN_VALUE) {
if (markedColumns.contains(j) && !markedRows.contains(i)) {
newMarked = true;
markedRows.add(i);
}
}
}
}
} // end while (newMarked)
// inverting of the marked columns
for (int i = 0; i < mappingTable.length; i++) {
if (!markedColumns.contains(i)) {
markedColumns.add(i);
} else {
markedColumns.removeElement(i);
}
}
// find the minimum in the marked range
int minimum = Integer.MAX_VALUE;
for (int i = 0; i < markedRows.size(); i++) {
for (int j = 0; j < markedColumns.size(); j++) {
if (mappingTable[markedColumns.get(j)][markedRows.get(i)] < minimum) {
minimum = mappingTable[markedColumns.get(j)][markedRows.get(i)];
}
}
}
// substract the minimum from all elements in the marked range
for (int i = 0; i < markedRows.size(); i++) {
for (int j = 0; j < markedColumns.size(); j++) {
mappingTable[markedColumns.get(j)][markedRows.get(i)] = mappingTable[markedColumns.get(j)][markedRows
.get(i)] - minimum;
}
}
// add the minimum to all elements which are neither marked in a row nor in a column
for (int i = 0; i < mappingTable.length; i++) {
if (!markedColumns.contains(i)) {
for (int j = 0; j < mappingTable[i].length; j++) {
if (!markedRows.contains(j)) {
mappingTable[i][j] = mappingTable[i][j] + minimum;
}
}
}
}
// reset the Integer.MIN_VALUE and Integer.MAX_VALUE to zero
for (int i = 0; i < mappingTable.length; i++) {
for (int j = 0; j < mappingTable[i].length; j++) {
if (mappingTable[i][j] == Integer.MAX_VALUE) {
mappingTable[i][j] = 0;
}
if (mappingTable[i][j] == Integer.MIN_VALUE) {
mappingTable[i][j] = 0;
}
}
}
} // end while(!assignmentAvailable)
// compute the mapping (there must be a possible assignment)
HashMap<Integer, String> clusterToPrediction = new HashMap<Integer, String>();
for (int i = 0; i < mappingTable.length; i++) {
int result = -1;
for (int j = 0; j < mappingTable[i].length; j++) {
if (mappingTable[i][j] == Integer.MIN_VALUE) {
result = j;
break;
}
}
String resultCluster = intToClusterMapping.get(result);
clusterToPrediction.put(i, resultCluster);
}
// insert the result in the predicted attribute
HashMap<String, Integer> predictionToCluster = new HashMap<String, Integer>();
// set the preditedLabel in the example table and compute to each prediction the cluster
int i = 0;
Attribute clusterAttribute = exampleSet.getAttributes().getCluster();
for (Example example : exampleSet) {
String resultLabel = clusterToPrediction.get((int) example.getValue(example.getAttributes().getCluster()));
example.setValue(predictedLabel, resultLabel);
if (predictionToCluster.size() < model.getNumberOfClusters()) {
if (!predictionToCluster.containsKey(example.getValueAsString(example.getAttributes().getPredictedLabel()))) {
String clusterNumber = example.getValueAsString(clusterAttribute).replaceAll("[^\\d]+", "");
try {
int number = Integer.parseInt(clusterNumber);
predictionToCluster.put(example.getValueAsString(example.getAttributes().getPredictedLabel()),
number);
} catch (NumberFormatException e) {
throw new UserError(this, 145, clusterAttribute.getName());
}
}
}
i++;
}
// set the confidence in the example table
i = 0;
for (Example example : exampleSet) {
if (model.getClass() == FlatFuzzyClusterModel.class) {
FlatFuzzyClusterModel fuzzyModel = (FlatFuzzyClusterModel) model;
for (int j = 0; j < clusterToPrediction.size(); j++) {
String label = clusterToPrediction.get(j);
example.setConfidence(label,
fuzzyModel.getExampleInClusterProbability(i, predictionToCluster.get(label)));
}
} else {
example.setConfidence(clusterToPrediction.get((int) example.getValue(example.getAttributes().getCluster())),
1);
}
i++;
}
exampleSetOutput.deliver(exampleSet);
clusterModelOutput.deliver(model);
}
/* Returns true, if there is a solution availble. */
private boolean assignmentAvailable(int[][] mappingTable) {
int markedZeros = 0;
boolean modificationDone = true;
while (modificationDone) {
while (modificationDone) {
modificationDone = false;
// column by column
for (int i = 0; i < mappingTable.length; i++) {
int position = -1;
for (int j = 0; j < mappingTable[i].length; j++) {
if (mappingTable[i][j] == 0) {
if (position == -1) {
position = j;
} else {
position = -1;
break;
}
}
}
if (position != -1) {
modificationDone = true;
mappingTable[i][position] = Integer.MIN_VALUE; // marked zero
for (int k = 0; k < mappingTable.length; k++) {
if (mappingTable[k][position] == 0) {
mappingTable[k][position] = Integer.MAX_VALUE; // slashed zeros
}
}
markedZeros++;
}
}
if (markedZeros == mappingTable.length) {
return true;
}
// line by line
for (int i = 0; i < mappingTable[0].length; i++) {
int position = -1;
for (int j = 0; j < mappingTable.length; j++) {
if (mappingTable[j][i] == 0) {
if (position == -1) {
position = j;
} else {
position = -1;
break;
}
}
}
if (position != -1) {
modificationDone = true;
mappingTable[position][i] = Integer.MIN_VALUE;// marked zero
for (int k = 0; k < mappingTable[0].length; k++) {
if (mappingTable[position][k] == 0) {
mappingTable[position][k] = Integer.MAX_VALUE; // slashed zeros
}
}
markedZeros++;
}
}
if (markedZeros == mappingTable.length) {
return true;
}
}
// modificationDone is here always false
// ambiguous zeros
int aktMarkedZeros = markedZeros;
for (int i = 0; i < mappingTable.length; i++) {
for (int j = 0; j < mappingTable[i].length; j++) {
if (mappingTable[i][j] == 0) {
mappingTable[i][j] = Integer.MIN_VALUE;// marked zero
for (int k = j + 1; k < mappingTable[i].length; k++) {
if (mappingTable[i][k] == 0) {
mappingTable[i][k] = Integer.MAX_VALUE; // slashed zeros in the same
// column
}
}
for (int k = 0; k < mappingTable.length; k++) {
if (mappingTable[k][j] == 0) {
mappingTable[k][j] = Integer.MAX_VALUE; // slashed zeros
}
}
modificationDone = true;
markedZeros++;
break;
}
}
if (aktMarkedZeros != markedZeros) {
break;
}
}
if (markedZeros == mappingTable.length) {
return true;
}
}
return false;
}1 -
Hi, how should I use this code in the program? Where should I copy and use?
Thankful
Sorry i'm asking0 -
hi
sorry
please help me
thanks
0 -
Hi @mschmitz,
one further question in this connection. Which classification model does the "Map Clustering on Labels" operator consider with regard to the subsequent calculation of performance values?
Thank you in advance for your response!
Best regards!0 -
The Map Clustering on Labels "model" simply chooses a cluster for each class and maps to that, by minimizing the total number of errors produced by the mapping. Assignments by cluster are exclusive. It then calculates the performance metrics by looking at "predictions" (based on the mapped clusters) and the "actual" (the label). You need to have the same number of clusters as you have label classes for this operator to work.2