Mini Batch K-means in RapidMiner
Hi
I have a huge dataset (4000000 records) of text data and I want to do clustering.
Because of memory problems and time complexity of text pre-processing I want to read small batches from database and after pre-processing use mini-batch K-means to cluster data. But I wonder how to use mini-batch clustering in RapidMiner.
Thanks in advance for your answers.
Answers
-
Hi,
there are different Loop operators in RapidMiner.
You can easily implement this batching behaviour by using a loop with a numeric counter and select data from your database with LIMIT n OFFSET (i - 1) * n.
n would be your preferred batch size, and i the current iteration number, starting at 1. Usually you need to calculate the offset yourself outside of the statement, e. g. with Generate Macro. Not all databases support the LIMIT ... OFFSET syntax, but most have the functionality under a different name.
Regards,
Balázs
0 -
Hi thanks for your answer
Mini batch K-Means algorithm takes small batches of the dataset for each iteration. It then assigns a cluster to each data point in the batch, depending on the previous locations of the cluster centroids and updates the locations of cluster centroids based on the new points from the batch.
How could I make a process like this?
because loop operator in each iteration makes new clusters for current batch and doesn’t assign new points to previous clusters0 -
Hi,
for this algorithm you'd need an operator to remember the cluster centroids from the previous clustering and a clustering operator that can take these as it's input. Extract Cluster Prototypes does something like this for the first step but I don't know a way for pushing these into a new clustering.
Regards,
Balázs
0 -
I was actually trying to work on a cluster model that I wanted to update with new data and rather than running the whole thing again planned to use the centroids to update it. (Limited resources on a hadoop cluster mean I can only cluster 1,000,000 records at a time).
This is what I considered which sounds similar to minibatch. About to test it, so maybe you guys could have a look?
The idea was to weight the centroids generated from Extract Cluster Prototypes by simply duplicating them. In my head I figured that would bias it towards that value for centroids, but not necessarily force the cluster to accept them as final-final.
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="9.0.002" expanded="true" height="68" name="Generate Data" width="90" x="45" y="289">
<parameter key="target_function" value="interaction classification"/>
<parameter key="number_examples" value="1000"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="9.0.002" expanded="true" height="82" name="Select Attributes (2)" width="90" x="112" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="label"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="split_data" compatibility="9.0.002" expanded="true" height="103" name="Split Data" width="90" x="246" y="136">
<enumeration key="partitions">
<parameter key="ratio" value="0.1"/>
<parameter key="ratio" value="0.1"/>
<parameter key="ratio" value="0.1"/>
<parameter key="ratio" value="0.1"/>
<parameter key="ratio" value="0.1"/>
<parameter key="ratio" value="0.1"/>
<parameter key="ratio" value="0.1"/>
<parameter key="ratio" value="0.1"/>
<parameter key="ratio" value="0.1"/>
</enumeration>
</operator>
<operator activated="true" class="subprocess" compatibility="9.0.002" expanded="true" height="82" name="First batch" width="90" x="380" y="34">
<process expanded="true">
<operator activated="true" class="concurrency:k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" origin="GENERATED_TUTORIAL" width="90" x="45" y="34">
<parameter key="k" value="3"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<operator activated="true" class="extract_prototypes" compatibility="9.0.002" expanded="true" height="82" name="Extract Cluster Prototypes" width="90" x="179" y="34"/>
<operator activated="true" class="concurrency:loop" compatibility="9.0.002" expanded="true" height="82" name="Loop" width="90" x="313" y="34">
<parameter key="number_of_iterations" value="100"/>
<process expanded="true">
<connect from_port="input 1" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Stupid way to 'weight the centroids'</description>
</operator>
<operator activated="true" class="append" compatibility="9.0.002" expanded="true" height="82" name="Append" width="90" x="447" y="34"/>
<operator activated="true" class="select_attributes" compatibility="9.0.002" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="cluster"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="add_noise" compatibility="9.0.002" expanded="true" height="103" name="Add Noise" width="90" x="246" y="187">
<list key="noise"/>
<description align="center" color="transparent" colored="false" width="126">Add a bit of noise... not sure why, but it feels good.</description>
</operator>
<connect from_port="in 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
<connect from_op="Extract Cluster Prototypes" from_port="example set" to_op="Loop" to_port="input 1"/>
<connect from_op="Loop" from_port="output 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Add Noise" to_port="example set input"/>
<connect from_op="Add Noise" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">100 records as a batch</description>
</operator>
<operator activated="true" class="append" compatibility="9.0.002" expanded="true" height="103" name="Append (2)" width="90" x="514" y="289"/>
<operator activated="true" class="concurrency:k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering (2)" origin="GENERATED_TUTORIAL" width="90" x="648" y="289">
<parameter key="k" value="3"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="First batch" to_port="in 1"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Append (2)" to_port="example set 2"/>
<connect from_op="First batch" from_port="out 1" to_op="Append (2)" to_port="example set 1"/>
<connect from_op="Append (2)" from_port="merged set" to_op="Clustering (2)" to_port="example set"/>
<connect from_op="Clustering (2)" from_port="cluster model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0