How to perform precision and recall with k-means and DBSCAN algorithms?

pvds90
pvds90 New Altair Community Member
edited November 5 in Community Q&A
Hi all,

I want to perform the precision/recall method on a K-means and DBSCAN algorithm. I've added a target label(Workaround) to the sample data set. Because of the map clustering on labels, i'm only able to set k=2. With other numbers it doesn't work because it has to match the amount of labels. Is there another way in RM to perform precision/recall on clustering algorithms without the map clustering so i can play with the number of k? 

I'm hoping that somebody can help me out. Thanks in advance

Regards,
Patrick

Best Answer

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    To get a supervised ML performance metric like precision or recall for an unsupervised ML method like clustering, you need to map them to labels so they can be evaluated as predictions versus a known actual state. So if you have only 2 label values, then you can only have two clusters to use the "Map Clustering on Labels" operator (because it will treat those clusters as label values so they can be mapped).

    You could theoretically do this with more than two clusters, but you would then need to map the extra clusters manually to your two labels, so in the end you would still be effectively measuring the performance of only two clusters (or "superclusters" since they are just combinations of smaller clusters).

    Alternatively you could increase the number of label values, so if you had three label values then you could support 3 clusters, etc.

Answers

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Hi,

    you can always set the roles of your attributes yourself. Set up one attribute with the role "label", another with "prediction", and Performance should work on the example set. It might need confidences, too, for some measures like AUC. So you might want to generate those using Generate Attributes.

    Regards,
    Balázs
  • pvds90
    pvds90 New Altair Community Member
    edited January 2020
    Hi Balazs @BalazsBarany , thank you for your answer. The case is that there are 15 attributes in the set, one of them contains an extreme value date. This is our workaround and that is why i've added an extra attribute with a boolean value if it contains that extreme value date. That extra attribute is the target label. There are not really extra ''labels'' needed because only the labelled one contains the workaround that i'm looking for (with date 2099-01-01). I'm do not completely understand why it would make sense to add extra role's?

  • Telcontar120
    Telcontar120 New Altair Community Member
    To map clusters onto labels, you need to have the same number of clusters as you have labels.  In the screenshot above, the errror you are getting is because you have 2 label values but 3 clusters.
    Try rerunning your k-means with k=2 and then doing the clustering mapping.  That should allow you to get he performance metrics you want.
  • pvds90
    pvds90 New Altair Community Member
    @Telcontar120 I know. It is working when i run it with k=2, but is there another way of working that i can change K to other values? It feels now very limited because it only runs with k=2. 
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    To get a supervised ML performance metric like precision or recall for an unsupervised ML method like clustering, you need to map them to labels so they can be evaluated as predictions versus a known actual state. So if you have only 2 label values, then you can only have two clusters to use the "Map Clustering on Labels" operator (because it will treat those clusters as label values so they can be mapped).

    You could theoretically do this with more than two clusters, but you would then need to map the extra clusters manually to your two labels, so in the end you would still be effectively measuring the performance of only two clusters (or "superclusters" since they are just combinations of smaller clusters).

    Alternatively you could increase the number of label values, so if you had three label values then you could support 3 clusters, etc.