Clustering of the Text
gunjanamit
New Altair Community Member
I wanted to cluster the survey comments in different categories like
Comment Category
Restrooms Stinks FMG
Food was costly Restaurant
Poor service in restaurant Restaurant
I want to read to read the comments from excel and write it back in excel with Category.
Can anyone please suggest how to do this?
Comment Category
Restrooms Stinks FMG
Food was costly Restaurant
Poor service in restaurant Restaurant
I want to read to read the comments from excel and write it back in excel with Category.
Can anyone please suggest how to do this?
Tagged:
0
Answers
-
Hi,
if you already know which categories you are looking for, you should label your training data manually with these categories and then train a classification algorithm on it. A good choice for text processing could be the SVM.
If you can't or don't want to label your data, just run a clustering algorithm like k-Means on your preprocessed documents, and have a look at the clusters afterwards to see if they make sense for you.
Best, Marius0 -
I have followed the below process
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
<process expanded="true" height="252" width="681">
<operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
<parameter key="excel_file" value="C:\Users\guagg\Desktop\All\RapidMiner\read.xls"/>
<parameter key="imported_cell_range" value="A1:A6"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="k_means" compatibility="5.2.006" expanded="true" height="76" name="Clustering" width="90" x="313" y="75">
<parameter key="add_as_label" value="true"/>
<parameter key="remove_unlabeled" value="true"/>
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NominalMeasures"/>
<parameter key="nominal_measure" value="RussellRaoSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="5.2.006" expanded="true" height="76" name="Numerical to Binominal" width="90" x="514" y="120"/>
<connect from_op="Read Excel" from_port="output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
But its not giving me correct results.
Results
cluster_0 I love food
cluster_1 washroom stinks
cluster_2 service is poor
cluster_0 food is great
cluster_0 not great service
Last one should be Cluster 2 not Cluster 0.
Please suggest!!!
0 -
You are processing texts, so you should have a close look at the Text Extension. You'll find links to tutorials in the post linked in my signature.
Best, Marius0 -
Marius,
I cant find the link. Please give again.
Regards
gunjan0 -
Just click my sigature where it says in big red letters "click here" and read the first item in linked post.0