"clustering atomic data files? [UPDATE]"

Question

hi,

so I'm a total noob on data mining - never used such programs and and thought I post a quick question about the clustering-options of the tool.

I have a dataset between 1000 to 10^6 lines, each line has 6 values, namley x y z vx vy vz (for the interested one: coordinates and velocieites). The data in 2d looks like this:

So the dataset contains of several clusters beginning from 1 single point up to maybe 100-250 (or so ...). For my studies I created an artificial one with 1000 points, two big groups of points and 10-20 single dots in (for testing the program obviously).

The goal as an output would be at least two histograms like:

"number of clusters of size N (the grey lines in the picture)" vs "size N (black dots inside)" or
"number of clusters of size N" vs "center of mass velocity"

I tried a few operators: Learner -> Unsupervised -> Clustering -> EM_Clustering or W-XMEans worked best but I have to give initial values like number of clusters, which would in the obove example 8 (every single dot counts as a single "cluster" ). If the program groups the big ones each, it groups the single ones also as one sort :-(. So n=6 would give the best result (big ones, and one sort of single clusters). So thats not what I want.

For the histograms: I would need a new colum (which I get from the clustering (cluster 0, cluster 1, ...)) and then a sort of if-condition to sum up the important columns (velocities, ...) and then again do some analysis with the new columns and so on.

Is something like this possible? I read or tried to read the manual but its 600 pages and searching for clustering just leads me to the clusterwrite/read operator.

maybe I'm applying the operators wrong?  I mean I cannont give a certrain number of clusters, becaus in a huge dataset, 10^5 lines and maybe 1000 groups all different sizes.Also I didn't get the thing with the attribues of the dataset.
But the distance in coordinates, say less then a certain value, would qualify some atoms to a cluster. Can I give a cutoff-radius or some other criterion to the cluster-algorithm?

grateful for every hint,
Stever

Edit: the way I did this was: new Operator -> IO -> Examples -> ExampleSource, then New Operator -> Unsupervised -> Clustering -> ...

Edit2: ideal cluster criterion would be: are there points inside a specific radius R_0 of particel n? if so,it counts to the cluster c1, then: is there another particle in R_0 if yes go on,  if not move on to next particle. If one particle has no neighbours inside R_0, it counts as an independent cluster c2, etc

stever1k · Answer

hi,

yes everything works great, I just had to find the right cluster algorithm, as you said, the DBScan works best for me. Still there is one problem left, which I haven't figure out to solve:

The cluster algorithm adds a new column to my data, which I convert to a number ("cluster1" -> 1.0, "cluster87" -> 87.0) to do some awk stuff later on. The tab has a few entries like

cluster id x y z ...
...

I would like to calculate an average over the x (y,z...) columns but only for certain cluster. He should sum up all values in the x column but only for cluster1, than give me the avg, cluster2 -> avg and so on. I haven't found the right operator for doing this. Any hints?

best wishes,
Stever

land · Answer

Hi,
first of all: Nothing you mentioned seems to be impossible. But it will need some more or less complex process construction. If I understood you correctly, you should take a look at the operators named Aggregation and AttributeConstruction. You will probably need them to build the data for the histograms.
The algorithm you proposed for clustering (cutting of above a distance) is similar to the behavior of the DBScan clustering which uses some sort of density measure to cluster and will return all outlying data points as noise (first cluster).

One final hint on clustering performance: You can't really say, if a clustering is good or bad, because usually you don't have any objective criterion you can measure the cluster assignment with. So there are two ways of asses a clustering: Using a heuristical measure like the Davies Bouldin Index (see ClusterCentroidEvaluator) or having real labels available, which is in real world applications impossible. In the latter case you could use Cluster2Prediction and afterwards measure the clustering like a classification algorithm.

Greetings,
  Sebastian