"Text Clustering"
Ingo - I've taken this as far as I can and now I'm stuck! I've created the following experiment that attempts to cluster text extracted from a sample Excel file containing 14 examples, 0 special attributes and 8 regular attributes. Here's the syntax so far ...
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="datamanagement" value="long_array"/>
<parameter key="excel_file" value="C:\feedback.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="comments"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="default_content_language" value="english"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_of_attributes" value="2"/>
<parameter key="target_function" value="gaussian mixture clusters"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="3"/>
</operator>
</operator>
This process produce 3 clusters. Cluster 0 has 33 items, Cluster 1 has 55 items, and Cluster 3 has 12 on a total of 100 examples. At this point, I want to apply a meaningful, user-friendly label to each cluster that captures the key theme of each cluster. How can I figure out the key theme for each cluster? What steps are next?
Please help!
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="datamanagement" value="long_array"/>
<parameter key="excel_file" value="C:\feedback.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="comments"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="default_content_language" value="english"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_of_attributes" value="2"/>
<parameter key="target_function" value="gaussian mixture clusters"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="3"/>
</operator>
</operator>
This process produce 3 clusters. Cluster 0 has 33 items, Cluster 1 has 55 items, and Cluster 3 has 12 on a total of 100 examples. At this point, I want to apply a meaningful, user-friendly label to each cluster that captures the key theme of each cluster. How can I figure out the key theme for each cluster? What steps are next?
Please help!