Clustering of the Text

gunjanamit · June 2012

I wanted to cluster the survey comments in different categories like

Comment Category

Restrooms Stinks FMG
Food was costly Restaurant
Poor service in restaurant Restaurant

I want to read to read the comments from excel and write it back in excel with Category.

Can anyone please suggest how to do this?

MariusHelf · June 2012

Hi,

if you already know which categories you are looking for, you should label your training data manually with these categories and then train a classification algorithm on it. A good choice for text processing could be the SVM.
If you can't or don't want to label your data, just run a clustering algorithm like k-Means on your preprocessed documents, and have a look at the clusters afterwards to see if they make sense for you.

Best, Marius

gunjanamit · June 2012

I have followed the below process

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
<process expanded="true" height="252" width="681">
<operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
<parameter key="excel_file" value="C:\Users\guagg\Desktop\All\RapidMiner\read.xls"/>
<parameter key="imported_cell_range" value="A1:A6"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="k_means" compatibility="5.2.006" expanded="true" height="76" name="Clustering" width="90" x="313" y="75">
<parameter key="add_as_label" value="true"/>
<parameter key="remove_unlabeled" value="true"/>
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NominalMeasures"/>
<parameter key="nominal_measure" value="RussellRaoSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="5.2.006" expanded="true" height="76" name="Numerical to Binominal" width="90" x="514" y="120"/>
<connect from_op="Read Excel" from_port="output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

But its not giving me correct results.

Results
cluster_0 I love food
cluster_1 washroom stinks
cluster_2 service is poor
cluster_0 food is great
cluster_0 not great service

Last one should be Cluster 2 not Cluster 0.

Please suggest!!!

MariusHelf · June 2012

You are processing texts, so you should have a close look at the Text Extension. You'll find links to tutorials in the post linked in my signature.

Best, Marius

gunjanamit · June 2012

Marius,

I cant find the link. Please give again.

Regards
gunjan

MariusHelf · June 2012

Just click my sigature where it says in big red letters "click here" and read the first item in linked post.

Clustering of the Text

Answers

Categories