A program to recognize and reward our most engaged community members
Let N be the number of items, K the number of clusters and S = ceil(N/K) maximum cluster size.
N
K
S = ceil(N/K)
(item_id, cluster_id, distance)
cluster_id
S
item_id
</code>dists = []</pre><pre><code>clusts = [None] * N counts = [0] * K for i, v in enumerate(items): dist = map( lambda x: dist(x, v), centroids ) dd = map( lambda (k, v): (i, k, v), enumerate(dist) ) dists.extend(dd) dists = sorted(dists, key = lambda (x,y,z): z) for (item_id, cluster_id, d) in dists: if counts[cluster_id] >= S: continue if clusts[item_id] == None: clusts[item_id] = cluster_id counts[cluster_id] = counts[cluster_id] + 1
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="read_excel" compatibility="9.2.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="85"> <parameter key="excel_file" value="C:\Users\Lionel\Downloads\k-means.xlsx"/> <parameter key="sheet_selection" value="sheet number"/> <parameter key="sheet_number" value="1"/> <parameter key="imported_cell_range" value="A1"/> <parameter key="encoding" value="SYSTEM"/> <parameter key="first_row_as_names" value="true"/> <list key="annotations"/> <parameter key="date_format" value=""/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="read_all_values_as_polynominal" value="false"/> <list key="data_set_meta_data_information"> <parameter key="0" value="ürün ID.true.integer.attribute"/> <parameter key="1" value="hacim.true.integer.attribute"/> <parameter key="2" value="ağırlık.true.integer.attribute"/> <parameter key="3" value="satış miktar.true.integer.attribute"/> <parameter key="4" value="kırılganlık.true.polynominal.attribute"/> <parameter key="5" value="F.true.polynominal.attribute"/> </list> <parameter key="read_not_matching_values_as_missings" value="false"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="85"> <parameter key="attribute_name" value="ürün ID"/> <parameter key="target_role" value="id"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="nominal_to_numerical" compatibility="9.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="313" y="85"> <parameter key="return_preprocessing_model" value="false"/> <parameter key="create_view" value="false"/> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="kırılganlık"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="coding_type" value="dummy coding"/> <parameter key="use_comparison_groups" value="false"/> <list key="comparison_groups"/> <parameter key="unexpected_value_handling" value="all 0 and warning"/> <parameter key="use_underscore_in_name" value="false"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="85"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="F"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="set_macros" compatibility="9.2.001" expanded="true" height="82" name="Set Macros" width="90" x="715" y="85"> <list key="macros"> <parameter key="cluster_number" value="3"/> </list> </operator> <operator activated="true" class="python_scripting:execute_python" compatibility="9.2.000" expanded="true" height="103" name="Execute Python" width="90" x="849" y="85"> <parameter key="script" value="import pandas as pd from operator import itemgetter import numpy as np import random import sys from scipy.spatial import distance from sklearn.cluster import KMeans # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) C = %{cluster_number} def k_means(X) : kmeans = KMeans(n_clusters=C, random_state=0).fit(X) return kmeans.cluster_centers_ def samesizecluster( D ): """ in: point-to-cluster-centre distances D, Npt x C out: xtoc, X -> C, equal-size clusters """ Npt, C = D.shape clustersize = (Npt + C - 1) // C xcd = list( np.ndenumerate(D) ) # ((0,0), d00), ((0,1), d01) ... xcd.sort( key=itemgetter(1) ) xtoc = np.ones( Npt, int ) * -1 nincluster = np.zeros( C, int ) nall = 0 for (x,c), d in xcd: if xtoc[x] < 0 and nincluster[c] < clustersize: xtoc[x] = c nincluster[c] += 1 nall += 1 if nall >= Npt: break return xtoc def rm_main(data): data_2 = data.values #centres = random.sample(list(data_2), C ) centres = k_means(data_2) D = distance.cdist( data_2, centres ) xtoc = samesizecluster( D ) data['cluster'] = xtoc # connect 2 output ports to see the results return data"/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.001" expanded="true" height="82" name="Set Role (2)" width="90" x="983" y="85"> <parameter key="attribute_name" value="cluster"/> <parameter key="target_role" value="cluster"/> <list key="set_additional_roles"/> </operator> <connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/> <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Set Macros" to_port="through 1"/> <connect from_op="Set Macros" from_port="through 1" to_op="Execute Python" to_port="input 1"/> <connect from_op="Execute Python" from_port="output 1" to_op="Set Role (2)" to_port="example set input"/> <connect from_op="Set Role (2)" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="85"> <parameter key="repository_entry" value="//Samples/data/Iris"/> </operator> <operator activated="true" class="dbscan" compatibility="9.2.001" expanded="true" height="82" name="Clustering" width="90" x="246" y="85"> <parameter key="epsilon" value="0.8"/> <parameter key="min_points" value="40"/> <parameter key="add_cluster_attribute" value="true"/> <parameter key="add_as_label" value="false"/> <parameter key="remove_unlabeled" value="false"/> <parameter key="measure_types" value="MixedMeasures"/> <parameter key="mixed_measure" value="MixedEuclideanDistance"/> <parameter key="nominal_measure" value="NominalDistance"/> <parameter key="numerical_measure" value="EuclideanDistance"/> <parameter key="divergence" value="GeneralizedIDivergence"/> <parameter key="kernel_type" value="radial"/> <parameter key="kernel_gamma" value="1.0"/> <parameter key="kernel_sigma1" value="1.0"/> <parameter key="kernel_sigma2" value="0.0"/> <parameter key="kernel_sigma3" value="2.0"/> <parameter key="kernel_degree" value="3.0"/> <parameter key="kernel_shift" value="1.0"/> <parameter key="kernel_a" value="1.0"/> <parameter key="kernel_b" value="0.0"/> </operator> <connect from_op="Retrieve Iris" from_port="output" to_op="Clustering" to_port="example set"/> <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Note: This solution requires the "XML" panel which can be opened in the "View" menu and then "Show Panel". Activate the XML panel if you did not do this before.
Open your process in RapidMiner and open the XML panel. If you can't find it, make sure to follow the note above.
Copy the XML code from there and paste it somewhere else, for example into a forum post here on the community portal. By the way, if you post your XML here, please use the code environment which you get by clicking on the </> icon in the toolbar of the post.
In order to import such an XML description of your process, e.g. to use a process someone else has posted here in the forum, please follow the following steps:
Don't forget step 3 above - you need to accept the changed XML code first before you will see any changes in the process!
Regards,
Lionel