nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Siemens Community Catalyst Program

The Siemens Community Catalyst program was co-created with our community to acknowledge technology leaders who consistently contribute to the Siemens Community. Nominations are accepted on a rolling basis.

Nominate Now

Text Clustering using rapidminer

singing_bird_1

Hi all,

I am new in rapidminer

I have documents and I want to cluster them using k-medoids algorithm with cosine distance

I watched many videos, read tutorials and tried so much but it gives me wrong results (I compared the results with the results of another program)

so, please please write to me full steps to load, cluster and evaluate the documents.

Note: the documents are stored in a csv file such that, each document is put in only one cell and as total they are 396 rows or docs

help me please

Find more posts tagged with

AI Studio

Clustering

Text Mining + NLP

Accepted answers

All comments

Telcontar120

What do you mean by "it gives you wrong results"? Can you be more specific? Also if you can attach your RapidMiner process xml it would be easier to troubleshoot.

Thanks,

singing_bird_1

thank you so much for your help

I mean by wrong result is the distribution of the documents among the clusters

cluster0:22 items

cluster1:31 items

cluster2:343 items

attached my process

thanks

123.rmp

singing_bird_1

Untitled.png

Telcontar120

CLustering using k-means (or any of its variations) is not designed to divide the records evenly into clusters, but rather to minimize distance within clusters while it maximizes distance between clusters. Thus, if your only reasoning for why the clustering didn't work is that you have a very lumpy distribution of documents across clusters, I don't think that is a valid inference.

I looked at your process and since I don't have access to the data, I was not able to run it to validate the results. There did not appear to be any process errors, but there are a couple of things that are unusual--for example, why are you running "data to similarity" after text processing and then running the clusters on that output? "Data to similarity" is going to generate a record for every pairwise comparison among your original data elements so you end up with many more records than you start with. More conventionally you would run the clustering directly on the output of the text processing. I was also not able to interpret your performance operator either---is it a custom extension you coded or purchased in the marketplace? If not, which extension is it from? My installation of RapidMiner did not recognize it.

singing_bird_1

thank you so much for your reply

I don't know how to attach my process

I am attaching the dataset that iam using

performance operator is an extension and I attached it to rapidminer (it is silhouette coefficient)

I used data to similarity to convert or represent the docs tobinary vectors

can you please tell me how to attach the process? so that you can know exactly what the problem is

All_clusters_RM.csv

Telcontar120

With your data file I created a modified version of your process. This version runs without errors. I substituted the k-means clustering for k-mediods since it is much faster. I also changed your word vector to term frequency (you had it set at binary term occurences) and changed your distance metric to cosine similarity. I deactivated the data to similarity operator since it was not needed.

What extension is the performance operator from? I could not find it. So I left it off. But the new clusters are more evenly distributed if you are concerned about that.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.000" expanded="true" name="Process">
<parameter key="random_seed" value="2001"/>
<process expanded="true">
<operator activated="false" class="data_to_similarity" compatibility="7.6.000" expanded="true" height="82" name="Data to Similarity" width="90" x="380" y="187">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator activated="false" class="multiply" compatibility="7.6.000" expanded="true" height="68" name="Multiply" width="90" x="514" y="187"/>
<operator activated="false" class="dummy" compatibility="7.6.000" expanded="true" height="68" name="Performance" width="90" x="648" y="187"/>
<operator activated="true" class="read_csv" compatibility="7.6.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
<parameter key="csv_file" value="C:\Users\brian\Downloads\All_clusters_RM.csv"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" breakpoints="after" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text" width="90" x="112" y="136"/>
<operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="179" y="34">
<list key="specify_weights"/>
</operator>
<operator activated="true" breakpoints="after" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="313" y="34">
<parameter key="vector_creation" value="Term Frequency"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="85"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="k_means" compatibility="7.6.000" expanded="true" height="82" name="Clustering (2)" width="90" x="581" y="34">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<connect from_op="Data to Similarity" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Clustering (2)" to_port="example set"/>
<connect from_op="Clustering (2)" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering (2)" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

singing_bird_1

thank you so much

I have a question

how can I run your modified process?

how can i attach it and run it?

Telcontar120

See the instructions here: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-can-I-share-processes-without-RapidMiner-Server/ta-p/37047

Basically just copy the xml onto the xml tab in RapidMiner and then hit the green check mark.

singing_bird_1

thank you so much for your help