How to find the traits of each cluster?
Hi everyone,
Thanks in advanced for checking my question out and providing help if you know how!
I am developing a customer segmentation model. My question is this: after performing the clustering I get a result that shows me the various clusters. How do I find out what these clusters represent? Said differently, I'm looking to discover the traits of customers who fall within a particular cluster. For example, if the customers tend to be frequent purchasers with a high volume of items per transaction and likes to shop on saturday.
Is this information available?
Thanks again!
Matt
Answers
-
Hi,
Sure, that is easy. You can simply turn your "cluster" attribute into a label attribute with the operator "Set Role". Afterwards you can use any of the classification or weighting algorithms to tell you what the clusters are about. Attached below is an example where we first cluster the Iris data set into 3 clusters and then learn a decision tree to describe what the clusters are about.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="label"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="k_means" compatibility="7.5.003" expanded="true" height="82" name="Clustering" width="90" x="313" y="34">
<parameter key="k" value="3"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
<parameter key="attribute_name" value="cluster"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.5.003" expanded="true" height="82" name="Decision Tree" width="90" x="581" y="34"/>
<connect from_op="Retrieve Iris" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Extra cool: You can even combine this with the new operator "Get Decision Tree Path" to enrich each data point with the explanation why exacly it landed in this cluster. Check out this extended process here:
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="label"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="k_means" compatibility="7.5.003" expanded="true" height="82" name="Clustering" width="90" x="313" y="34">
<parameter key="k" value="3"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
<parameter key="attribute_name" value="cluster"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.5.003" expanded="true" height="82" name="Decision Tree" width="90" x="581" y="34"/>
<operator activated="true" class="operator_toolbox:get_dectree_path" compatibility="0.3.000" expanded="true" height="82" name="Get Decision Tree Path" width="90" x="715" y="34"/>
<connect from_op="Retrieve Iris" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_op="Get Decision Tree Path" to_port="mod"/>
<connect from_op="Decision Tree" from_port="exampleSet" to_op="Get Decision Tree Path" to_port="exa"/>
<connect from_op="Get Decision Tree Path" from_port="exa" to_port="result 1"/>
<connect from_op="Get Decision Tree Path" from_port="mod" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>You can find a link which describes how to import those XML files in my footer below.
Hope this helps,
Ingo
2 -
Hi,
my personal favourite is to do what Ingo proposed in a 1-vs-All fashion. this way you get the answer to the question: What makes cluster_x different to the other clusters.
Best,
Martin
0