"ERROR using K-means clustering algorithm with text data"

Hello,

I'm using text data that contains three attribute (NAME, LABEL, DOMAIN), this is a sample of the data:

NAME LABEL DOMAIN
------------------------------------------------------------------
origin from string
destination to string
departure day day date
departure month month date

I want to use k-means clustering operator in order to cluster the data, but unfortunately I got this ERROR before the execution:

" The setup does not seem to contain any obvious error, but you should check the log messages or activate the debug mode in the setting dialog in order to get more information about this problem"

Here it is the Log Messages:

Dec 26, 2012 1:23:44 AM INFO: Process //NewLocalRepository/IOS/EM starts
Dec 26, 2012 1:23:44 AM INFO: Loading initial data.
Dec 26, 2012 1:23:45 AM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Dec 26, 2012 1:23:45 AM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
+- Retrieve[1] (Retrieve)
==> +- Clustering[1] (k-Means)
Dec 26, 2012 1:23:45 AM SEVERE: java.lang.NullPointerException

and here it is the XML :

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="341" width="480">
      <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="126" y="140">
        <parameter key="repository_entry" value="../EXPIO/DDP"/>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="313" y="120">
        <parameter key="k" value="10"/>
        <parameter key="measure_types" value="NominalMeasures"/>
        <parameter key="nominal_measure" value="DiceSimilarity"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Any advice would be greatly appreciated. Thanks!

Find more posts tagged with

AI Studio

Clustering

Text Mining + NLP

Algorithms

Accepted answers

All comments

Skirzynski

I have executed your process with your short sample of the data, but couldn't reproduce the error. Can you provide a minimal amount of data (CSV) which does not work?

P.S.: Please use the code-tags in this forum for your processes and data.

basel_deeb

Thank you so much Mr. Marcin for your reply,
Actually I've surprised when i uninstalled RapidMiner then reinstalled it, it's worked

However, I've got a question if you don't mind, after generating the centroids clusters by K-means how can i know them because it is generating them as follow:

Cluster_0
Cluster_1
Cluster_2

Again thanks a lot

Skirzynski

If you take a look at the cluster model in the result view, you can see several different views. For instance, in the "Folder View" all cluster which actually contain any examples are displayed as a folder. If you click on an item inside the cluster you can see the details. What is interesting for you, is the "Centroid Table". All cluster centroids are listed with their values. If a cluster was created, but does not contain any example (because your k was too high), this centroids will have question marks instead of values.