"Process Documents

marvinrj
marvinrj New Altair Community Member
edited November 5 in Community Q&A
I was thinking if there's some clustering technique which allows automatic numbering of K (The number of clusters which should be used). i can't classify manually to confirm. could anyone advise me on it?

Answers

  • Hello

    The DBSCAN clustering algorithm will find a value of k but you still have to choose the optimum values for two parameters, namely, epsilon and min_points. So there is, unfortunately, no free lunch. You can use RapidMiner to try these parameter combinations and count the number of clusters that are found and then you can spot where there seem to be regions in the search space that tend to produce the same numbers of clusters.

    <shameless self promotion>
    You could download an example that I made here http://rapidminernotes.blogspot.com/2010/12/counting-clusters.html.
    </shameless self promotion>

    Many other techniques exist for finding clusters. The key is that they are unsupervised so a person always has to look at the answers to determine if they are right or not.

    regards

    Andrew
  • marvinrj
    marvinrj New Altair Community Member
    hi,

    It would be the solution to my problem. But when i've applied that clustering algorithm, the program was processing for over 2 hours. Then i 've stopped the process.
    that was normal??

    thanks awchisholm.
  • Hello

    You will often find that the run time is excessive; the number of examples, the number of attributes and the algorithm all contribute as well as the brute force nature of the search. To see what time to expect, you could reduce the example set by using the sample operator. Start with a very small number of examples like 1% of the total and see if the clustering completes at all. Then increase to 2%, 5% and so on. You should be able to make a prediction about how long it might take for the full data set.

    Regards

    Andrew