Where in the process to place the 'Cross validation' operator?

tonyboy9
New Altair Community Member
In the customer segmentation process below, I believe I've answered in the cluster model using k means, which cluster of customers (by ID) to use. This would be the answer to my problem statement.
I'm confused over where to place 'Cross validation'. The tutorials seem to indicate placing the operator after the 'retrieve' data set. At that point how does RapidMiner validate a model not yet developed by k means clustering down the line?
Any helpful suggestions are greatly appreciated.
I'm confused over where to place 'Cross validation'. The tutorials seem to indicate placing the operator after the 'retrieve' data set. At that point how does RapidMiner validate a model not yet developed by k means clustering down the line?
Any helpful suggestions are greatly appreciated.
Tagged:
0
Best Answer
-
I think the question is what do you mean by validating a clustering model? Validation normally implies that you have a set of observations where you know the correct answer so you can check the ML algorithm's prediction against a known outcome and "grade" its performance.
With clustering (or any unsupervised learning problem) there is no known correct answer in advance. You are simply using an algorithm to explore structures in your data and return results. You may or may not be happy with the outcome of any particular algorithm, but there is no objective way for the algorithm to "self-assess" its performance relative to other possible clustering solutions.
Now there are performance operators for clustering in RapidMiner that you might want to take a look at, which you can use to help you understand the outcome of any particular clustering solution, and that you can use to compare outcomes. And there are also different methods that people have suggested as helpful for evaluating or comparing different clustering outcomes (like the elbow method), but it is still somewhat subjective and there is no clear and compelling objective method for saying that one clustering outcome is superior to another unless you can specify in advance what the exact criteria are that you are going to use for that determination.5
Answers
-
Cross validation is an approach to model validation for supervised machine learning problems when you have a defined target variable (called the label in RapidMiner). If you look at the tutorial process for that operator, you can see that inside it, you put the training learner on the left part of the process, and the validation on the right side.
But clustering is an unsupervised machine learning problem, where there is no defined label in advance that you are trying to obtain. So generally speaking Cross Validation is not applicable when you are doing clustering.
0 -
Thanks for that, Brian. You wrote: "But clustering is an unsupervised machine learning problem, where there is no defined label in advance that you are trying to obtain. So generally speaking Cross Validation is not applicable when you are doing clustering."
So is there another way to go to validate a clustering model?0 -
I think the question is what do you mean by validating a clustering model? Validation normally implies that you have a set of observations where you know the correct answer so you can check the ML algorithm's prediction against a known outcome and "grade" its performance.
With clustering (or any unsupervised learning problem) there is no known correct answer in advance. You are simply using an algorithm to explore structures in your data and return results. You may or may not be happy with the outcome of any particular algorithm, but there is no objective way for the algorithm to "self-assess" its performance relative to other possible clustering solutions.
Now there are performance operators for clustering in RapidMiner that you might want to take a look at, which you can use to help you understand the outcome of any particular clustering solution, and that you can use to compare outcomes. And there are also different methods that people have suggested as helpful for evaluating or comparing different clustering outcomes (like the elbow method), but it is still somewhat subjective and there is no clear and compelling objective method for saying that one clustering outcome is superior to another unless you can specify in advance what the exact criteria are that you are going to use for that determination.5