Cluster-Analysis with wholesale customer dataset

mluethke87
mluethke87 New Altair Community Member
edited November 5 in Community Q&A

Hello everyone,

 

as a group of marketing students who participate in a course called "Marketing Analytics", we now have the task to make a cluster-analysis, using different clustering-methods, on the dataset from here:

 

https://archive.ics.uci.edu/ml/datasets/wholesale+customers

 

The exact description is the following:

 

"The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. Goal: Find Clusters of Customers"

 

For that, we should try out different Clustering methods (Professor told us next to k-means to try out DBSCAN and Hierachical Clustering)

 

 

Currently we did the following:

Added Operator: Read CSV -> Loaded in the Data-Set

Added Operator: Select Attributes -> Filtered out the nominal attributes Channel & Region

Added Operator: K-Means

 

First off we do not know how to find the optimum of "k" to use in RapidMiner? How can we get to this, how can we see the intradistance and so the "Ellbow" graph in rapid miner for this dataset? (I attached a graphic from a presentation i found)

 

As we have more than 2 attributes (Milk, Frozen, Fresh, Delicatess, Groceries, etc.) how can we visualize the clusters? What kind of clusters can we get out of this dataset?

 

Also, how can we use the DBSCAN Clustering ? If we just connect it with the Select Attributes operator and run it, we get only one cluster...

 

Our professor also told us to use some loop, is it also necessary to filter out Outliners?

 

Please help, we struggle a lot in this task. If someone is able to explain this task, he or she can also contact me private and I would offer something for the effort.

 

Thanks a lot!!

Best Answer

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓

    Hi @mluethke87,

     

    Your project is interesting : it highlights the difficulty of clustering some data.

    I investigated by beginning, what is normally a first step of the data science methodology : the Visual Data Analysis.

    We are in high dimensionnal space, but we can always represent an attribute x vs an attribute y in 2D.

    For example here Milk vs Grocery : 

    Marketing_Clustering.pngHow many cluster(s) did you see ?

     NB : We find this particular "distribution" of data with lot of combinaisons attribute_x vs attribute_y in your dataset.

    Visually, it's difficult to answer to the question "how many clusters are there ?" . It's subjectiv - every human is different - 

    but we can respond number of cluster = 1 with : 

     - 1 for the whole dataset or

     - 1 for the "the bigger or smaller cluster" in the corner at the bottom left, the rest of the data being unclassifiable, it's noise.

     

    Now we can see, what are the clusters got from the KMeans model with k = 6 (for recall k= 6 is given by the optimization of the Davies Bouldin in the process of @Thomas_Ott) : 

    Marketing_Clustering_2.png

    Secondly, we can see, what are the clusters got from the XMeans model recommanded by @Telcontar120 (model which conclude k = 4):

    Marketing_Clustering_3.png

     

    In both cases, we see, that, when we "force" an algorithm (Kmeans or Xmeans) to find clusters, theses clusters have very different "densities" in the case of your dataset.

    but when we use DBSCAN, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster, so we define an "estimation of the density of the clusters". 

    So for the DBSCAN algorithm to find clusters, the clusters must have similar densities, and that's why it is not able to manage clusters of different densities and in fine it always conclude in your case with number of cluster = 1.

     

    Second Part : RapidMiner vs Python (sorry this post is not finished yet....)

     

    First, for this history of number of cluster = 1, I decided to compare the results of RapidMiner's DBSCAN  with the

    results of Python's DBSCAN (sorry @sgenzer if you read this post) : In both cases, the conclusion is number of cluster = 1.

    But according to the setting of Epsilon / Min Points, Python's DBSCAN conclude that some data are "unlabelled"(it's noise) while in the case of RapidMiner, all the data are clustered in the only one cluster. 

    I think the conclusion of Python's DBSCAN logic. In deed, how said previously, with the definition of the DBSCAN algo, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster. From my point of view, there are data points in this dataset which are isolated, and so that they should not belong to a cluster,  and be considered as noise (according to the setting Epsilon / Min Points). For example, for epsilon = 1 / min points = 5, here are the conclusions of Python's DBSCAN : 

    Marketing_Clustering_4.png

    NB : in red, the clustered data, in blue the "unlabelled" data

     

    I thought that I will find this operation by checking the parameter remove unlabelled of the DBSCAN in RapidMiner, but ti is not the case.

    So my question is, why RapidMiner's DBSCAN is clustering all the data regardless of the setting epsilon / min points ?

     

    In conclusion, I hope that I contributed to the reflection on DBSCAN and your project.

    and now the post is actually finished (ouff....!)

     

    Best regards, 

     

    Lionel

     

     

     

     

     

     

     

     

     

     

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @mluethke87,

     

    Can you share your process and your dataset(s), please ?

     

    Some response elements : 

     

    For the optimum number of cluster "k", there is a theorical method (but it's not sure that this method works every time ....) 

    You can use the K-means model associated to the Performance (Cluster Distance Performance) operator - with the Davies Bouldin as

    Main  criterion - inside the Optimize Parameters operator and choose "k" as parameter to optimize : 

    the value of "k" which minimizes the Davies Bouldin index is the optimum value of k... (in theory if this value exist).

    Thanks to the experts for correcting me if I'm wrong.

     

    Regards, 

     

    Lionel

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi again @mluethke87,

     

    Sorry, I read your post to fast, OK : your process is in attached file, and there is a link to your dataset...

     

    Regards,

     

    Lionel

  • mluethke87
    mluethke87 New Altair Community Member

    Hey,

     

    I attached the process. It includes as well other Clustering-Methods connected to a Multiplier.

    Still right now it is all about the K-Means - and how to determine the correct number of "k" to use for this task.

     

    People spread graphics showing the "ellbow" though I do not see any explanation showing step by step how this is done in Rapid Miner.

     

    You said:

    "..associated to the Performance (Cluster Distance Performance) operator - with the Davies Bouldin as

    Main  criterion - inside the Optimize Parameters operator and choose "k" as parameter to optimize.."

     

    I inserted the operator Optimize Paramters (Grid) and it does not show any of these functions / steps that you explained :/

    Can you visually show this because I do not get there.

     

    Thank you

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    @mluethke87 You can compute the Davies Bouldin by using one of the Cluster Performance operators and use it in the Optimize Parameter operator, like below.

     

    You should definately review the Optimization video tutorial to get familiar with it, it's a very powerful operator. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="open_file" compatibility="8.0.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
    <parameter key="resource_type" value="URL"/>
    <parameter key="url" value="https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv"/>
    </operator>
    <operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
    <parameter key="csv_file" value="C:\Users\lueth\Desktop\Wholesale customers data.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Channel.true.binominal.attribute"/>
    <parameter key="1" value="Region.true.polynominal.attribute"/>
    <parameter key="2" value="Fresh.true.integer.attribute"/>
    <parameter key="3" value="Milk.true.integer.attribute"/>
    <parameter key="4" value="Grocery.true.integer.attribute"/>
    <parameter key="5" value="Frozen.true.integer.attribute"/>
    <parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
    <parameter key="7" value="Delicassen.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Channel|Region"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="145" name="Multiply" width="90" x="447" y="34"/>
    <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.0.001" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="715" y="391">
    <list key="parameters">
    <parameter key="Clustering.k" value="[2.0;100.0;10;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="Clustering" width="90" x="112" y="34"/>
    <operator activated="true" class="cluster_distance_performance" compatibility="8.0.001" expanded="true" height="103" name="Performance" width="90" x="313" y="34">
    <parameter key="main_criterion" value="Davies Bouldin"/>
    </operator>
    <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
    <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
    <connect from_op="Performance" from_port="performance" to_port="performance"/>
    <connect from_op="Performance" from_port="cluster model" to_port="model"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="x_means" compatibility="8.0.001" expanded="true" height="82" name="X-Means" width="90" x="715" y="136"/>
    <operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="k-Means" width="90" x="715" y="34">
    <parameter key="measure_types" value="NumericalMeasures"/>
    </operator>
    <operator activated="true" class="agglomerative_clustering" compatibility="8.0.001" expanded="true" height="82" name="Agglomerative Clustering" width="90" x="715" y="238">
    <parameter key="mode" value="AverageLink"/>
    <parameter key="measure_types" value="NumericalMeasures"/>
    </operator>
    <connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
    <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="k-Means" to_port="example set"/>
    <connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Agglomerative Clustering" to_port="example set"/>
    <connect from_op="Multiply" from_port="output 4" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 6"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="model" to_port="result 7"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="parameter set" to_port="result 8"/>
    <connect from_op="X-Means" from_port="cluster model" to_port="result 4"/>
    <connect from_op="X-Means" from_port="clustered set" to_port="result 5"/>
    <connect from_op="k-Means" from_port="cluster model" to_port="result 1"/>
    <connect from_op="k-Means" from_port="clustered set" to_port="result 2"/>
    <connect from_op="Agglomerative Clustering" from_port="cluster model" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <portSpacing port="sink_result 6" spacing="0"/>
    <portSpacing port="sink_result 7" spacing="0"/>
    <portSpacing port="sink_result 8" spacing="0"/>
    <portSpacing port="sink_result 9" spacing="0"/>
    </process>
    </operator>
    </process>

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @mluethke87,

     

    To provide response elements to your question "how can we see the intradistance and so the "Ellbow" graph in rapid miner for this dataset? (I attached a graphic from a presentation i found)" : 

    I don't know exactly what is Ellbow graph, I don't think that RapidMiner provides such graphs.

    Your first graph show the within Sum of Squares vs k (number of clusters). RapidMiner don't calculate the within Sum of Squares but the Average within centroid distance.

    You can obtain a similar curve by representing the Average within centroid distance vs k using the Log operator. In your case, we obtain this curve

    Optimize_k_Kmeans_2.png

    and here the process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="open_file" compatibility="8.0.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
    <parameter key="resource_type" value="URL"/>
    <parameter key="url" value="https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv"/>
    </operator>
    <operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
    <parameter key="csv_file" value="C:\Users\lueth\Desktop\Wholesale customers data.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Channel.true.binominal.attribute"/>
    <parameter key="1" value="Region.true.polynominal.attribute"/>
    <parameter key="2" value="Fresh.true.integer.attribute"/>
    <parameter key="3" value="Milk.true.integer.attribute"/>
    <parameter key="4" value="Grocery.true.integer.attribute"/>
    <parameter key="5" value="Frozen.true.integer.attribute"/>
    <parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
    <parameter key="7" value="Delicassen.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Channel|Region"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="166" name="Multiply" width="90" x="447" y="34"/>
    <operator activated="true" class="concurrency:loop_parameters" compatibility="8.0.001" expanded="true" height="103" name="Loop Parameters" width="90" x="715" y="391">
    <list key="parameters">
    <parameter key="Clustering.k" value="[2.0;10;10;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="Clustering" width="90" x="246" y="85">
    <parameter key="k" value="3"/>
    </operator>
    <operator activated="true" class="cluster_distance_performance" compatibility="8.0.001" expanded="true" height="103" name="Performance" width="90" x="447" y="85"/>
    <operator activated="true" class="log" compatibility="8.0.001" expanded="true" height="82" name="Log" width="90" x="581" y="85">
    <list key="log">
    <parameter key="k" value="operator.Clustering.parameter.k"/>
    <parameter key="Average within centroid distance" value="operator.Performance.value.avg_within_distance"/>
    </list>
    </operator>
    <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
    <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
    <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
    <connect from_op="Performance" from_port="cluster model" to_port="output 1"/>
    <connect from_op="Log" from_port="through 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <portSpacing port="sink_output 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="x_means" compatibility="8.0.001" expanded="true" height="82" name="X-Means" width="90" x="715" y="136"/>
    <operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="k-Means" width="90" x="715" y="34">
    <parameter key="measure_types" value="NumericalMeasures"/>
    </operator>
    <operator activated="true" class="agglomerative_clustering" compatibility="8.0.001" expanded="true" height="82" name="Agglomerative Clustering" width="90" x="715" y="238">
    <parameter key="mode" value="AverageLink"/>
    <parameter key="measure_types" value="NumericalMeasures"/>
    </operator>
    <connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
    <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="k-Means" to_port="example set"/>
    <connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Agglomerative Clustering" to_port="example set"/>
    <connect from_op="Multiply" from_port="output 4" to_port="result 6"/>
    <connect from_op="Multiply" from_port="output 5" to_op="Loop Parameters" to_port="input 1"/>
    <connect from_op="Loop Parameters" from_port="output 2" to_port="result 7"/>
    <connect from_op="X-Means" from_port="cluster model" to_port="result 4"/>
    <connect from_op="X-Means" from_port="clustered set" to_port="result 5"/>
    <connect from_op="k-Means" from_port="cluster model" to_port="result 1"/>
    <connect from_op="k-Means" from_port="clustered set" to_port="result 2"/>
    <connect from_op="Agglomerative Clustering" from_port="cluster model" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <portSpacing port="sink_result 6" spacing="0"/>
    <portSpacing port="sink_result 7" spacing="0"/>
    <portSpacing port="sink_result 8" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

    Lionel

     

     

  • mluethke87
    mluethke87 New Altair Community Member

    Hey,

     

    thanks a lot guys for your help already.

     

    So in the graph you showed, it would make sense to use k=3, because the avg. centroid distance in relation to the number of clusters would be "optimal", as when you would continue using more clusters, the avg. centroid distance wouldnt grow as much anymore, correct?

     

    Still, within the subprocess, you put in a k-means as well, which is preconfigured to 3 - in this case the number only affects the number of runs the loop makes right or does it affect anything at all?

     

    Also, if i delete the multiply operator and just connect  the loop parameters to the select attributes, the graph for avg. centroid distance created in this subprocess is different but why is this affected by it?

     

    See screenshots attached & XML Code

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="open_file" compatibility="8.0.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
    <parameter key="resource_type" value="URL"/>
    <parameter key="url" value="https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv"/>
    </operator>
    <operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
    <parameter key="csv_file" value="C:\Users\lueth\Desktop\Wholesale customers data.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Channel.true.binominal.attribute"/>
    <parameter key="1" value="Region.true.polynominal.attribute"/>
    <parameter key="2" value="Fresh.true.integer.attribute"/>
    <parameter key="3" value="Milk.true.integer.attribute"/>
    <parameter key="4" value="Grocery.true.integer.attribute"/>
    <parameter key="5" value="Frozen.true.integer.attribute"/>
    <parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
    <parameter key="7" value="Delicassen.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Channel|Region"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="concurrency:loop_parameters" compatibility="8.0.001" expanded="true" height="82" name="Loop Parameters" width="90" x="447" y="136">
    <list key="parameters">
    <parameter key="k-means.k" value="[2.0;100.0;10;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="k-means" width="90" x="179" y="85"/>
    <operator activated="true" class="cluster_distance_performance" compatibility="8.0.001" expanded="true" height="103" name="Cluster Distance Perf." width="90" x="380" y="85"/>
    <operator activated="true" class="log" compatibility="8.0.001" expanded="true" height="103" name="Log" width="90" x="514" y="34">
    <list key="log">
    <parameter key="k" value="operator.k-means.parameter.k"/>
    <parameter key="Average within centroid distance" value="operator.Cluster Distance Perf\..value.avg_within_distance"/>
    </list>
    </operator>
    <connect from_port="input 1" to_op="k-means" to_port="example set"/>
    <connect from_op="k-means" from_port="cluster model" to_op="Cluster Distance Perf." to_port="cluster model"/>
    <connect from_op="k-means" from_port="clustered set" to_op="Cluster Distance Perf." to_port="example set"/>
    <connect from_op="Cluster Distance Perf." from_port="performance" to_op="Log" to_port="through 1"/>
    <connect from_op="Cluster Distance Perf." from_port="cluster model" to_op="Log" to_port="through 2"/>
    <connect from_op="Log" from_port="through 1" to_port="performance"/>
    <connect from_op="Log" from_port="through 2" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
    <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Loop Parameters" to_port="input 1"/>
    <connect from_op="Loop Parameters" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

    Thank you!

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @mluethke87,

     

    just a response element about the optimal "k" value : 

     - according to the graph, the optimal value of "k" seems to be 3 for the reason you give.

     - according to the Davies Bouldin index, the optimal value of "k" is 6 when running the process provided by @Thomas_Ott.

     

    Maybe you can consider the two cases in parall in your project.

     

    Regards, 

     

    Lionel

     

  • mluethke87
    mluethke87 New Altair Community Member

    okay thank you!

     

    Still, can you guys please tell me why DBSCAN spits out only 1 Cluster (All Data is clustered in one ?) ? Why did our professor even mention this algorithm if it does not even fit our dataset?

     

    Also, is there a way to show each correlation values if I compare : Milk - Grocery, etc.? So I can see if some of these categories even have a correlation at all?

     

    Thank you!

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @mluethke87,

     

    1. Maybe your professor don't want to give you the "right answer", but want that you experiment by yourself the model DBSCAN to see 

    its behaviour and what are its advantages and disadvantages. 

    So I propose you to try different combinaisons of the DBSCAN 's algorithm parameters (epsilon / min points)  to determine, in each case, how many cluster(s), it conclude.

     

    2. To determine the correlations between your attributes you can : 

     - Represent graphically attributes by attributes to  see visualy if there is a correlation between them.

     - Use the Correlation Matrix operator to see if there is a linear correlation between 2 of your attributes. This matrix product a number in range [0,1] for each couple of your attributes knowing that 0 = no correlation / 1 = perfect correlation.

     

     

    Regards,

     

    Lionel 

     

  • Telcontar120
    Telcontar120 New Altair Community Member

    You can also check your k-means work by using the X-means operator, which recommends/selects an optimal value for k based on the BIC (similar but not identical to the DBI method you are using manually above).

     

  • mluethke87
    mluethke87 New Altair Community Member

    Hey, 

     

    thanks again for the reply.

     

    The problem is, I only get 1 Cluster, so it seems not to work with the DBSCAN. I looked up everywhere, but could not find a proper solution why it is like that. I played around with the epsilon and with the min points for sure.

     

    Can you tell me why this dataset does not get clustered trough the DBSCAN? It is frustrating but I am not even sure if it is possible to work :/

     

     

    XML File attached

     

    Thanks!


  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @mluethke87,

     

    Here a possible response element : 

     

    "DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters." (extract of DBSCAN article on wikipedia) : 

     

    https://en.wikipedia.org/wiki/DBSCAN

     

    Maybe it's the case for your dataset and that's why the DBSCAN has trouble to "isolate" some clusters.

     

    Regards, 

     

    Lionel

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓

    Hi @mluethke87,

     

    Your project is interesting : it highlights the difficulty of clustering some data.

    I investigated by beginning, what is normally a first step of the data science methodology : the Visual Data Analysis.

    We are in high dimensionnal space, but we can always represent an attribute x vs an attribute y in 2D.

    For example here Milk vs Grocery : 

    Marketing_Clustering.pngHow many cluster(s) did you see ?

     NB : We find this particular "distribution" of data with lot of combinaisons attribute_x vs attribute_y in your dataset.

    Visually, it's difficult to answer to the question "how many clusters are there ?" . It's subjectiv - every human is different - 

    but we can respond number of cluster = 1 with : 

     - 1 for the whole dataset or

     - 1 for the "the bigger or smaller cluster" in the corner at the bottom left, the rest of the data being unclassifiable, it's noise.

     

    Now we can see, what are the clusters got from the KMeans model with k = 6 (for recall k= 6 is given by the optimization of the Davies Bouldin in the process of @Thomas_Ott) : 

    Marketing_Clustering_2.png

    Secondly, we can see, what are the clusters got from the XMeans model recommanded by @Telcontar120 (model which conclude k = 4):

    Marketing_Clustering_3.png

     

    In both cases, we see, that, when we "force" an algorithm (Kmeans or Xmeans) to find clusters, theses clusters have very different "densities" in the case of your dataset.

    but when we use DBSCAN, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster, so we define an "estimation of the density of the clusters". 

    So for the DBSCAN algorithm to find clusters, the clusters must have similar densities, and that's why it is not able to manage clusters of different densities and in fine it always conclude in your case with number of cluster = 1.

     

    Second Part : RapidMiner vs Python (sorry this post is not finished yet....)

     

    First, for this history of number of cluster = 1, I decided to compare the results of RapidMiner's DBSCAN  with the

    results of Python's DBSCAN (sorry @sgenzer if you read this post) : In both cases, the conclusion is number of cluster = 1.

    But according to the setting of Epsilon / Min Points, Python's DBSCAN conclude that some data are "unlabelled"(it's noise) while in the case of RapidMiner, all the data are clustered in the only one cluster. 

    I think the conclusion of Python's DBSCAN logic. In deed, how said previously, with the definition of the DBSCAN algo, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster. From my point of view, there are data points in this dataset which are isolated, and so that they should not belong to a cluster,  and be considered as noise (according to the setting Epsilon / Min Points). For example, for epsilon = 1 / min points = 5, here are the conclusions of Python's DBSCAN : 

    Marketing_Clustering_4.png

    NB : in red, the clustered data, in blue the "unlabelled" data

     

    I thought that I will find this operation by checking the parameter remove unlabelled of the DBSCAN in RapidMiner, but ti is not the case.

    So my question is, why RapidMiner's DBSCAN is clustering all the data regardless of the setting epsilon / min points ?

     

    In conclusion, I hope that I contributed to the reflection on DBSCAN and your project.

    and now the post is actually finished (ouff....!)

     

    Best regards, 

     

    Lionel

     

     

     

     

     

     

     

     

     

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi again,

     

    To complete my previous post, the associated process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
    <parameter key="csv_file" value="C:\Users\Lionel\Downloads\Wholesale customers data.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Channel.true.integer.attribute"/>
    <parameter key="1" value="Region.true.integer.attribute"/>
    <parameter key="2" value="Fresh.true.integer.attribute"/>
    <parameter key="3" value="Milk.true.integer.attribute"/>
    <parameter key="4" value="Grocery.true.integer.attribute"/>
    <parameter key="5" value="Frozen.true.integer.attribute"/>
    <parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
    <parameter key="7" value="Delicassen.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Channel|Region"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="124" name="Multiply" width="90" x="380" y="34"/>
    <operator activated="true" class="x_means" compatibility="8.0.001" expanded="true" height="82" name="X-Means" width="90" x="581" y="238"/>
    <operator activated="true" class="dbscan" compatibility="8.0.001" expanded="true" height="82" name="DBSCAN" width="90" x="581" y="136">
    <parameter key="remove_unlabeled" value="true"/>
    <parameter key="measure_types" value="NumericalMeasures"/>
    </operator>
    <operator activated="true" class="k_means" compatibility="8.0.001" expanded="true" height="82" name="k-Means" width="90" x="581" y="34">
    <parameter key="k" value="6"/>
    <parameter key="measure_types" value="NumericalMeasures"/>
    </operator>
    <operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV (2)" width="90" x="112" y="493">
    <parameter key="csv_file" value="C:\Users\Lionel\Downloads\Wholesale customers data.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Channel.true.integer.attribute"/>
    <parameter key="1" value="Region.true.integer.attribute"/>
    <parameter key="2" value="Fresh.true.integer.attribute"/>
    <parameter key="3" value="Milk.true.integer.attribute"/>
    <parameter key="4" value="Grocery.true.integer.attribute"/>
    <parameter key="5" value="Frozen.true.integer.attribute"/>
    <parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
    <parameter key="7" value="Delicassen.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="DBSCAN_Python" width="90" x="313" y="493">
    <parameter key="script" value="import numpy as np&#10;import pandas as pd&#10;&#10;from sklearn.cluster import DBSCAN&#10;from sklearn.preprocessing import StandardScaler&#10;&#10;from collections import Counter&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; path= 'C:/Users/Lionel/Downloads'&#10; file = 'Wholesale customers data.csv'&#10;&#10; epsilon = 1&#10; min_points = 5&#10; &#10; &#10; data = pd.read_csv(path + '/' + file)&#10; X = data.iloc[:,2:]&#10; Y = StandardScaler().fit_transform(X)&#10; &#10; db = DBSCAN(eps=epsilon, min_samples=min_points).fit(Y)&#10; &#10; labels = db.labels_&#10; label = pd.DataFrame(data = labels,columns = ['cluster'])&#10; label = label.join(X)&#10;&#10; counter = Counter(label.cluster)&#10; count = pd.DataFrame.from_dict(counter, orient='index').reset_index()&#10; count = count.rename(columns={'index':'cluster', 0:'number of elements'})&#10; count.cluster = [0,&quot;-1 (unlabelled)&quot;]&#10; &#10;&#10; # connect 2 output ports to see the results&#10; return label,count"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="k-Means" to_port="example set"/>
    <connect from_op="Multiply" from_port="output 2" to_op="DBSCAN" to_port="example set"/>
    <connect from_op="Multiply" from_port="output 3" to_op="X-Means" to_port="example set"/>
    <connect from_op="X-Means" from_port="cluster model" to_port="result 8"/>
    <connect from_op="X-Means" from_port="clustered set" to_port="result 7"/>
    <connect from_op="DBSCAN" from_port="cluster model" to_port="result 3"/>
    <connect from_op="DBSCAN" from_port="clustered set" to_port="result 4"/>
    <connect from_op="k-Means" from_port="cluster model" to_port="result 1"/>
    <connect from_op="k-Means" from_port="clustered set" to_port="result 2"/>
    <connect from_op="Read CSV (2)" from_port="output" to_op="DBSCAN_Python" to_port="input 1"/>
    <connect from_op="DBSCAN_Python" from_port="output 1" to_port="result 5"/>
    <connect from_op="DBSCAN_Python" from_port="output 2" to_port="result 6"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <portSpacing port="sink_result 6" spacing="0"/>
    <portSpacing port="sink_result 7" spacing="0"/>
    <portSpacing port="sink_result 8" spacing="0"/>
    <portSpacing port="sink_result 9" spacing="0"/>
    </process>
    </operator>
    </process>

    Best regards, 

     

    Lionel