"cluster model implementation"

Question

Hi,

I am going to write operators for clustering for rapid miner, especially fuzzy c-means and related algorithms. Up until 5.1, I did not find them and they are not in the core implementation of rapid miner. Also the current implementation of FlatFuzzyClusterModel seems to be a probability oriented implementation rather than a fuzzy one. Therefore, I dont only have to implement the algoithms them selfs, but also some infrastructure. Especially a fuzzy cluster model class is needed. The question is, what is the best way to approach this? I want to do this right so I want it to blend into the current rapid miner implementation of existing clusterers and their infrastructure. There are some direct questions:

1. I have looked at the current cluster model implementation and I find it kind of complicated for the functionality it provides. Why is for each cluster a own object stored with a list of indices of their members? getClusterIndexOfId, getClusterAssignments and getClusterAssignments would be better of with just storing an index array. The number of clusters as well as the number of data objects per cluster could be stored separately? The function getClusterIndexOfId even asks every cluster object if a specific ID is contained which creates a lot of overhead to just gain the information to which cluster a specific example is assigned. Where is the reasoning behind this and should I implement it the same way for fuzzy clusters?

2. I have the feeling that it is quite inefficient to do distance calculations as they are implemented. The way it works is, that each numerical attribute is assigned through a chain of function calls. Let me explain further:
From the KMeans class:for (int step = 0; (step < maxOptimizationSteps) && !stable; step++) {
	// assign examles to new centroids
	i = 0;
	for (Example example : exampleSet) {
		double[] exampleValues = getAsDoubleArray(example, attributes);
		double nearestDistance = measure.calculateDistance(model.getCentroidCoordinates(0), exampleValues);
		int nearestIndex = 0;
		for (int centroidIndex = 1; centroidIndex < k; centroidIndex++) {
			double distance = measure.calculateDistance(model.getCentroidCoordinates(centroidIndex), exampleValues);
			if (distance < nearestDistance) {
				nearestDistance = distance;
				nearestIndex = centroidIndex;
			}
		}
		centroidAssignments = nearestIndex;
		model.getCentroid(nearestIndex).assignExample(exampleValues);
		i++;
	}
		// finishing assignment
	stable = model.finishAssign();
}

(124) double distance = measure.calculateDistance(model.getCentroidCoordinates(centroidIndex), exampleValues)
The centroids are stored as double array, no problem here.
But the example values are calculated from the data:
(120) double[] exampleValues = getAsDoubleArray(example, attributes);
withprivate double[] getAsDoubleArray(Example example, Attributes attributes) {
	double[] values = new double[attributes.size()];
	int i = 0;
	for (Attribute attribute : attributes) {
		values = example.getValue(attribute);
		i++;
	}
	return values;
}

That means, for each data object and each iteration, the constructor has to create a double array to determine the location of the data object.

But to calculate the location, for each entry, getValue triggers the function implemented in AbstractAttribute:public double getValue(DataRow row) {
	double tableValue = row.get(getTableIndex(), getDefault()); 
	double result = tableValue;
	for (AttributeTransformation transformation : transformations) {
		result = transformation.transform(this, result);
	}
	return result;
}

Which means that all transformations has to be invoked in every iteration, for every example. Depending on the number of transformations (which can be a lot of data cleaning is used before clustering) that is a lot of work for simply determining the value of an attribute. And I am not only concerned about the actual computation time all this requires. The array is not stored after it is used. Or recycled. Its just thrown away which invokes the GC quite often and requires a lot of dynamic memory handling (which is terribly slow compared to just use the same double array over and over again). I am not sure if this is detected by the compiler and replaced accordingly.

Is it possible to speed things up by not doing all this stuff as implemented for KMeans but to copy the relevant data and use it instead of rebuilding the values of the attributes? Or does this contradict some of Rapid Miners code style rules?

3. I hate to ask but when faced with the issue above: but the Euclidean distance implementation every time uses the Math.sqrt function.. that is not needed in most cases as the squared value will do as well (or is even required sometimes).. It hurts me to implement an algorithm that throws away performance like that..

Best Regards,
Roland

land · Answer

Hi Roland,

nice to hear that you are going to implement some additional clustering operators. I'm glad to hear that someone wants to contribute.

The way you are asking shows, that you have already made yourself familiar with the RapidMiner code on a very detailed basis. I would like to give you answers to all your questions but since this will be a very technical discussion I would suggest shifting the medium to the special interest group for development. If you send me your email address by personal mail, I will invite you and we can continue the discussion there.

Greetings,
  Sebastian