Cluster-Analysis with wholesale customer dataset
Hello everyone,
as a group of marketing students who participate in a course called "Marketing Analytics", we now have the task to make a cluster-analysis, using different clustering-methods, on the dataset from here:
https://archive.ics.uci.edu/ml/datasets/wholesale+customers
The exact description is the following:
"The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. Goal: Find Clusters of Customers"
For that, we should try out different Clustering methods (Professor told us next to k-means to try out DBSCAN and Hierachical Clustering)
Currently we did the following:
Added Operator: Read CSV -> Loaded in the Data-Set
Added Operator: Select Attributes -> Filtered out the nominal attributes Channel & Region
Added Operator: K-Means
First off we do not know how to find the optimum of "k" to use in RapidMiner? How can we get to this, how can we see the intradistance and so the "Ellbow" graph in rapid miner for this dataset? (I attached a graphic from a presentation i found)
As we have more than 2 attributes (Milk, Frozen, Fresh, Delicatess, Groceries, etc.) how can we visualize the clusters? What kind of clusters can we get out of this dataset?
Also, how can we use the DBSCAN Clustering ? If we just connect it with the Select Attributes operator and run it, we get only one cluster...
Our professor also told us to use some loop, is it also necessary to filter out Outliners?
Please help, we struggle a lot in this task. If someone is able to explain this task, he or she can also contact me private and I would offer something for the effort.
Thanks a lot!!
Best Answer
-
Hi @mluethke87,
Your project is interesting : it highlights the difficulty of clustering some data.
I investigated by beginning, what is normally a first step of the data science methodology : the Visual Data Analysis.
We are in high dimensionnal space, but we can always represent an attribute x vs an attribute y in 2D.
For example here Milk vs Grocery :
How many cluster(s) did you see ?
NB : We find this particular "distribution" of data with lot of combinaisons attribute_x vs attribute_y in your dataset.
Visually, it's difficult to answer to the question "how many clusters are there ?" . It's subjectiv - every human is different -
but we can respond number of cluster = 1 with :
- 1 for the whole dataset or
- 1 for the "the bigger or smaller cluster" in the corner at the bottom left, the rest of the data being unclassifiable, it's noise.
Now we can see, what are the clusters got from the KMeans model with k = 6 (for recall k= 6 is given by the optimization of the Davies Bouldin in the process of @Thomas_Ott) :
Secondly, we can see, what are the clusters got from the XMeans model recommanded by @Telcontar120 (model which conclude k = 4):
In both cases, we see, that, when we "force" an algorithm (Kmeans or Xmeans) to find clusters, theses clusters have very different "densities" in the case of your dataset.
but when we use DBSCAN, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster, so we define an "estimation of the density of the clusters".
So for the DBSCAN algorithm to find clusters, the clusters must have similar densities, and that's why it is not able to manage clusters of different densities and in fine it always conclude in your case with number of cluster = 1.
Second Part : RapidMiner vs Python (sorry this post is not finished yet....)
First, for this history of number of cluster = 1, I decided to compare the results of RapidMiner's DBSCAN with the
results of Python's DBSCAN (sorry @sgenzer if you read this post) : In both cases, the conclusion is number of cluster = 1.
But according to the setting of Epsilon / Min Points, Python's DBSCAN conclude that some data are "unlabelled"(it's noise) while in the case of RapidMiner, all the data are clustered in the only one cluster.
I think the conclusion of Python's DBSCAN logic. In deed, how said previously, with the definition of the DBSCAN algo, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster. From my point of view, there are data points in this dataset which are isolated, and so that they should not belong to a cluster, and be considered as noise (according to the setting Epsilon / Min Points). For example, for epsilon = 1 / min points = 5, here are the conclusions of Python's DBSCAN :
NB : in red, the clustered data, in blue the "unlabelled" data
I thought that I will find this operation by checking the parameter remove unlabelled of the DBSCAN in RapidMiner, but ti is not the case.
So my question is, why RapidMiner's DBSCAN is clustering all the data regardless of the setting epsilon / min points ?
In conclusion, I hope that I contributed to the reflection on DBSCAN and your project.
and now the post is actually finished (ouff....!)
Best regards,
Lionel
1
Answers
-
Hi @mluethke87,
Can you share your process and your dataset(s), please ?
Some response elements :
For the optimum number of cluster "k", there is a theorical method (but it's not sure that this method works every time ....)
You can use the K-means model associated to the Performance (Cluster Distance Performance) operator - with the Davies Bouldin as
Main criterion - inside the Optimize Parameters operator and choose "k" as parameter to optimize :
the value of "k" which minimizes the Davies Bouldin index is the optimum value of k... (in theory if this value exist).
Thanks to the experts for correcting me if I'm wrong.
Regards,
Lionel
1 -
Hi again @mluethke87,
Sorry, I read your post to fast, OK : your process is in attached file, and there is a link to your dataset...
Regards,
Lionel
0 -
Hey,
I attached the process. It includes as well other Clustering-Methods connected to a Multiplier.
Still right now it is all about the K-Means - and how to determine the correct number of "k" to use for this task.
People spread graphics showing the "ellbow" though I do not see any explanation showing step by step how this is done in Rapid Miner.
You said:
"..associated to the Performance (Cluster Distance Performance) operator - with the Davies Bouldin as
Main criterion - inside the Optimize Parameters operator and choose "k" as parameter to optimize.."
I inserted the operator Optimize Paramters (Grid) and it does not show any of these functions / steps that you explained
Can you visually show this because I do not get there.
Thank you
0 -
@mluethke87 You can compute the Davies Bouldin by using one of the Cluster Performance operators and use it in the Optimize Parameter operator, like below.
You should definately review the Optimization video tutorial to get familiar with it, it's a very powerful operator.
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="open_file" compatibility="8.0.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
<parameter key="resource_type" value="URL"/>
<parameter key="url" value="https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv"/>
</operator>
<operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
<parameter key="csv_file" value="C:\Users\lueth\Desktop\Wholesale customers data.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Channel.true.binominal.attribute"/>
<parameter key="1" value="Region.true.polynominal.attribute"/>
<parameter key="2" value="Fresh.true.integer.attribute"/>
<parameter key="3" value="Milk.true.integer.attribute"/>
<parameter key="4" value="Grocery.true.integer.attribute"/>
<parameter key="5" value="Frozen.true.integer.attribute"/>
<parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
<parameter key="7" value="Delicassen.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Channel|Region"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="145" name="Multiply" width="90" x="447" y="34"/>
<operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.0.001" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="715" y="391">
<list key="parameters">
<parameter key="Clustering.k" value="[2.0;100.0;10;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="Clustering" width="90" x="112" y="34"/>
<operator activated="true" class="cluster_distance_performance" compatibility="8.0.001" expanded="true" height="103" name="Performance" width="90" x="313" y="34">
<parameter key="main_criterion" value="Davies Bouldin"/>
</operator>
<connect from_port="input 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
<connect from_op="Performance" from_port="performance" to_port="performance"/>
<connect from_op="Performance" from_port="cluster model" to_port="model"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
</process>
</operator>
<operator activated="true" class="x_means" compatibility="8.0.001" expanded="true" height="82" name="X-Means" width="90" x="715" y="136"/>
<operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="k-Means" width="90" x="715" y="34">
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<operator activated="true" class="agglomerative_clustering" compatibility="8.0.001" expanded="true" height="82" name="Agglomerative Clustering" width="90" x="715" y="238">
<parameter key="mode" value="AverageLink"/>
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="k-Means" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/>
<connect from_op="Multiply" from_port="output 3" to_op="Agglomerative Clustering" to_port="example set"/>
<connect from_op="Multiply" from_port="output 4" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 6"/>
<connect from_op="Optimize Parameters (Grid)" from_port="model" to_port="result 7"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter set" to_port="result 8"/>
<connect from_op="X-Means" from_port="cluster model" to_port="result 4"/>
<connect from_op="X-Means" from_port="clustered set" to_port="result 5"/>
<connect from_op="k-Means" from_port="cluster model" to_port="result 1"/>
<connect from_op="k-Means" from_port="clustered set" to_port="result 2"/>
<connect from_op="Agglomerative Clustering" from_port="cluster model" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
<portSpacing port="sink_result 8" spacing="0"/>
<portSpacing port="sink_result 9" spacing="0"/>
</process>
</operator>
</process>0 -
Hi @mluethke87,
To provide response elements to your question "how can we see the intradistance and so the "Ellbow" graph in rapid miner for this dataset? (I attached a graphic from a presentation i found)" :
I don't know exactly what is Ellbow graph, I don't think that RapidMiner provides such graphs.
Your first graph show the within Sum of Squares vs k (number of clusters). RapidMiner don't calculate the within Sum of Squares but the Average within centroid distance.
You can obtain a similar curve by representing the Average within centroid distance vs k using the Log operator. In your case, we obtain this curve :
and here the process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="open_file" compatibility="8.0.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
<parameter key="resource_type" value="URL"/>
<parameter key="url" value="https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv"/>
</operator>
<operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
<parameter key="csv_file" value="C:\Users\lueth\Desktop\Wholesale customers data.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Channel.true.binominal.attribute"/>
<parameter key="1" value="Region.true.polynominal.attribute"/>
<parameter key="2" value="Fresh.true.integer.attribute"/>
<parameter key="3" value="Milk.true.integer.attribute"/>
<parameter key="4" value="Grocery.true.integer.attribute"/>
<parameter key="5" value="Frozen.true.integer.attribute"/>
<parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
<parameter key="7" value="Delicassen.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Channel|Region"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="166" name="Multiply" width="90" x="447" y="34"/>
<operator activated="true" class="concurrency:loop_parameters" compatibility="8.0.001" expanded="true" height="103" name="Loop Parameters" width="90" x="715" y="391">
<list key="parameters">
<parameter key="Clustering.k" value="[2.0;10;10;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="Clustering" width="90" x="246" y="85">
<parameter key="k" value="3"/>
</operator>
<operator activated="true" class="cluster_distance_performance" compatibility="8.0.001" expanded="true" height="103" name="Performance" width="90" x="447" y="85"/>
<operator activated="true" class="log" compatibility="8.0.001" expanded="true" height="82" name="Log" width="90" x="581" y="85">
<list key="log">
<parameter key="k" value="operator.Clustering.parameter.k"/>
<parameter key="Average within centroid distance" value="operator.Performance.value.avg_within_distance"/>
</list>
</operator>
<connect from_port="input 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
<connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
<connect from_op="Performance" from_port="cluster model" to_port="output 1"/>
<connect from_op="Log" from_port="through 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="x_means" compatibility="8.0.001" expanded="true" height="82" name="X-Means" width="90" x="715" y="136"/>
<operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="k-Means" width="90" x="715" y="34">
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<operator activated="true" class="agglomerative_clustering" compatibility="8.0.001" expanded="true" height="82" name="Agglomerative Clustering" width="90" x="715" y="238">
<parameter key="mode" value="AverageLink"/>
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="k-Means" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/>
<connect from_op="Multiply" from_port="output 3" to_op="Agglomerative Clustering" to_port="example set"/>
<connect from_op="Multiply" from_port="output 4" to_port="result 6"/>
<connect from_op="Multiply" from_port="output 5" to_op="Loop Parameters" to_port="input 1"/>
<connect from_op="Loop Parameters" from_port="output 2" to_port="result 7"/>
<connect from_op="X-Means" from_port="cluster model" to_port="result 4"/>
<connect from_op="X-Means" from_port="clustered set" to_port="result 5"/>
<connect from_op="k-Means" from_port="cluster model" to_port="result 1"/>
<connect from_op="k-Means" from_port="clustered set" to_port="result 2"/>
<connect from_op="Agglomerative Clustering" from_port="cluster model" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
<portSpacing port="sink_result 8" spacing="0"/>
</process>
</operator>
</process>Regards,
Lionel
1 -
Hey,
thanks a lot guys for your help already.
So in the graph you showed, it would make sense to use k=3, because the avg. centroid distance in relation to the number of clusters would be "optimal", as when you would continue using more clusters, the avg. centroid distance wouldnt grow as much anymore, correct?
Still, within the subprocess, you put in a k-means as well, which is preconfigured to 3 - in this case the number only affects the number of runs the loop makes right or does it affect anything at all?
Also, if i delete the multiply operator and just connect the loop parameters to the select attributes, the graph for avg. centroid distance created in this subprocess is different but why is this affected by it?
See screenshots attached & XML Code
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="open_file" compatibility="8.0.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
<parameter key="resource_type" value="URL"/>
<parameter key="url" value="https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv"/>
</operator>
<operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
<parameter key="csv_file" value="C:\Users\lueth\Desktop\Wholesale customers data.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Channel.true.binominal.attribute"/>
<parameter key="1" value="Region.true.polynominal.attribute"/>
<parameter key="2" value="Fresh.true.integer.attribute"/>
<parameter key="3" value="Milk.true.integer.attribute"/>
<parameter key="4" value="Grocery.true.integer.attribute"/>
<parameter key="5" value="Frozen.true.integer.attribute"/>
<parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
<parameter key="7" value="Delicassen.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Channel|Region"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="concurrency:loop_parameters" compatibility="8.0.001" expanded="true" height="82" name="Loop Parameters" width="90" x="447" y="136">
<list key="parameters">
<parameter key="k-means.k" value="[2.0;100.0;10;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="k-means" width="90" x="179" y="85"/>
<operator activated="true" class="cluster_distance_performance" compatibility="8.0.001" expanded="true" height="103" name="Cluster Distance Perf." width="90" x="380" y="85"/>
<operator activated="true" class="log" compatibility="8.0.001" expanded="true" height="103" name="Log" width="90" x="514" y="34">
<list key="log">
<parameter key="k" value="operator.k-means.parameter.k"/>
<parameter key="Average within centroid distance" value="operator.Cluster Distance Perf\..value.avg_within_distance"/>
</list>
</operator>
<connect from_port="input 1" to_op="k-means" to_port="example set"/>
<connect from_op="k-means" from_port="cluster model" to_op="Cluster Distance Perf." to_port="cluster model"/>
<connect from_op="k-means" from_port="clustered set" to_op="Cluster Distance Perf." to_port="example set"/>
<connect from_op="Cluster Distance Perf." from_port="performance" to_op="Log" to_port="through 1"/>
<connect from_op="Cluster Distance Perf." from_port="cluster model" to_op="Log" to_port="through 2"/>
<connect from_op="Log" from_port="through 1" to_port="performance"/>
<connect from_op="Log" from_port="through 2" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Loop Parameters" to_port="input 1"/>
<connect from_op="Loop Parameters" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Thank you!
0 -
Hi @mluethke87,
just a response element about the optimal "k" value :
- according to the graph, the optimal value of "k" seems to be 3 for the reason you give.
- according to the Davies Bouldin index, the optimal value of "k" is 6 when running the process provided by @Thomas_Ott.
Maybe you can consider the two cases in parall in your project.
Regards,
Lionel
0 -
okay thank you!
Still, can you guys please tell me why DBSCAN spits out only 1 Cluster (All Data is clustered in one ?) ? Why did our professor even mention this algorithm if it does not even fit our dataset?
Also, is there a way to show each correlation values if I compare : Milk - Grocery, etc.? So I can see if some of these categories even have a correlation at all?
Thank you!
0 -
Hi @mluethke87,
1. Maybe your professor don't want to give you the "right answer", but want that you experiment by yourself the model DBSCAN to see
its behaviour and what are its advantages and disadvantages.
So I propose you to try different combinaisons of the DBSCAN 's algorithm parameters (epsilon / min points) to determine, in each case, how many cluster(s), it conclude.
2. To determine the correlations between your attributes you can :
- Represent graphically attributes by attributes to see visualy if there is a correlation between them.
- Use the Correlation Matrix operator to see if there is a linear correlation between 2 of your attributes. This matrix product a number in range [0,1] for each couple of your attributes knowing that 0 = no correlation / 1 = perfect correlation.
Regards,
Lionel
1 -
You can also check your k-means work by using the X-means operator, which recommends/selects an optimal value for k based on the BIC (similar but not identical to the DBI method you are using manually above).
1 -
Hey,
thanks again for the reply.
The problem is, I only get 1 Cluster, so it seems not to work with the DBSCAN. I looked up everywhere, but could not find a proper solution why it is like that. I played around with the epsilon and with the min points for sure.
Can you tell me why this dataset does not get clustered trough the DBSCAN? It is frustrating but I am not even sure if it is possible to work
XML File attached
Thanks!
0 -
Hi @mluethke87,
Here a possible response element :
"DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters." (extract of DBSCAN article on wikipedia) :
https://en.wikipedia.org/wiki/DBSCAN
Maybe it's the case for your dataset and that's why the DBSCAN has trouble to "isolate" some clusters.
Regards,
Lionel
0 -
Hi @mluethke87,
Your project is interesting : it highlights the difficulty of clustering some data.
I investigated by beginning, what is normally a first step of the data science methodology : the Visual Data Analysis.
We are in high dimensionnal space, but we can always represent an attribute x vs an attribute y in 2D.
For example here Milk vs Grocery :
How many cluster(s) did you see ?
NB : We find this particular "distribution" of data with lot of combinaisons attribute_x vs attribute_y in your dataset.
Visually, it's difficult to answer to the question "how many clusters are there ?" . It's subjectiv - every human is different -
but we can respond number of cluster = 1 with :
- 1 for the whole dataset or
- 1 for the "the bigger or smaller cluster" in the corner at the bottom left, the rest of the data being unclassifiable, it's noise.
Now we can see, what are the clusters got from the KMeans model with k = 6 (for recall k= 6 is given by the optimization of the Davies Bouldin in the process of @Thomas_Ott) :
Secondly, we can see, what are the clusters got from the XMeans model recommanded by @Telcontar120 (model which conclude k = 4):
In both cases, we see, that, when we "force" an algorithm (Kmeans or Xmeans) to find clusters, theses clusters have very different "densities" in the case of your dataset.
but when we use DBSCAN, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster, so we define an "estimation of the density of the clusters".
So for the DBSCAN algorithm to find clusters, the clusters must have similar densities, and that's why it is not able to manage clusters of different densities and in fine it always conclude in your case with number of cluster = 1.
Second Part : RapidMiner vs Python (sorry this post is not finished yet....)
First, for this history of number of cluster = 1, I decided to compare the results of RapidMiner's DBSCAN with the
results of Python's DBSCAN (sorry @sgenzer if you read this post) : In both cases, the conclusion is number of cluster = 1.
But according to the setting of Epsilon / Min Points, Python's DBSCAN conclude that some data are "unlabelled"(it's noise) while in the case of RapidMiner, all the data are clustered in the only one cluster.
I think the conclusion of Python's DBSCAN logic. In deed, how said previously, with the definition of the DBSCAN algo, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster. From my point of view, there are data points in this dataset which are isolated, and so that they should not belong to a cluster, and be considered as noise (according to the setting Epsilon / Min Points). For example, for epsilon = 1 / min points = 5, here are the conclusions of Python's DBSCAN :
NB : in red, the clustered data, in blue the "unlabelled" data
I thought that I will find this operation by checking the parameter remove unlabelled of the DBSCAN in RapidMiner, but ti is not the case.
So my question is, why RapidMiner's DBSCAN is clustering all the data regardless of the setting epsilon / min points ?
In conclusion, I hope that I contributed to the reflection on DBSCAN and your project.
and now the post is actually finished (ouff....!)
Best regards,
Lionel
1 -
Hi again,
To complete my previous post, the associated process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
<parameter key="csv_file" value="C:\Users\Lionel\Downloads\Wholesale customers data.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Channel.true.integer.attribute"/>
<parameter key="1" value="Region.true.integer.attribute"/>
<parameter key="2" value="Fresh.true.integer.attribute"/>
<parameter key="3" value="Milk.true.integer.attribute"/>
<parameter key="4" value="Grocery.true.integer.attribute"/>
<parameter key="5" value="Frozen.true.integer.attribute"/>
<parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
<parameter key="7" value="Delicassen.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Channel|Region"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="124" name="Multiply" width="90" x="380" y="34"/>
<operator activated="true" class="x_means" compatibility="8.0.001" expanded="true" height="82" name="X-Means" width="90" x="581" y="238"/>
<operator activated="true" class="dbscan" compatibility="8.0.001" expanded="true" height="82" name="DBSCAN" width="90" x="581" y="136">
<parameter key="remove_unlabeled" value="true"/>
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="k-Means" width="90" x="581" y="34">
<parameter key="k" value="6"/>
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV (2)" width="90" x="112" y="493">
<parameter key="csv_file" value="C:\Users\Lionel\Downloads\Wholesale customers data.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Channel.true.integer.attribute"/>
<parameter key="1" value="Region.true.integer.attribute"/>
<parameter key="2" value="Fresh.true.integer.attribute"/>
<parameter key="3" value="Milk.true.integer.attribute"/>
<parameter key="4" value="Grocery.true.integer.attribute"/>
<parameter key="5" value="Frozen.true.integer.attribute"/>
<parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
<parameter key="7" value="Delicassen.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="DBSCAN_Python" width="90" x="313" y="493">
<parameter key="script" value="import numpy as np import pandas as pd from sklearn.cluster import DBSCAN from sklearn.preprocessing import StandardScaler from collections import Counter # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): path= 'C:/Users/Lionel/Downloads' file = 'Wholesale customers data.csv' epsilon = 1 min_points = 5 data = pd.read_csv(path + '/' + file) X = data.iloc[:,2:] Y = StandardScaler().fit_transform(X) db = DBSCAN(eps=epsilon, min_samples=min_points).fit(Y) labels = db.labels_ label = pd.DataFrame(data = labels,columns = ['cluster']) label = label.join(X) counter = Counter(label.cluster) count = pd.DataFrame.from_dict(counter, orient='index').reset_index() count = count.rename(columns={'index':'cluster', 0:'number of elements'}) count.cluster = [0,"-1 (unlabelled)"] # connect 2 output ports to see the results return label,count"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="k-Means" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="DBSCAN" to_port="example set"/>
<connect from_op="Multiply" from_port="output 3" to_op="X-Means" to_port="example set"/>
<connect from_op="X-Means" from_port="cluster model" to_port="result 8"/>
<connect from_op="X-Means" from_port="clustered set" to_port="result 7"/>
<connect from_op="DBSCAN" from_port="cluster model" to_port="result 3"/>
<connect from_op="DBSCAN" from_port="clustered set" to_port="result 4"/>
<connect from_op="k-Means" from_port="cluster model" to_port="result 1"/>
<connect from_op="k-Means" from_port="clustered set" to_port="result 2"/>
<connect from_op="Read CSV (2)" from_port="output" to_op="DBSCAN_Python" to_port="input 1"/>
<connect from_op="DBSCAN_Python" from_port="output 1" to_port="result 5"/>
<connect from_op="DBSCAN_Python" from_port="output 2" to_port="result 6"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
<portSpacing port="sink_result 8" spacing="0"/>
<portSpacing port="sink_result 9" spacing="0"/>
</process>
</operator>
</process>Best regards,
Lionel
1