"Kmeans Cluster"

Question

Hi, I am doing kmeans clustering through rapidminer. Earlier I used 2GB for rapidminer. I faced java heap error. Then I have increased RAM to 8GB now I am using 6GB for rapidminer. Even though the same error is coming. My input dataset contains 65K records. The size of my input file 25 MB. I am bit confused, if the small input file can not able to process kmeans, how the big data can be deal with rapidminer. Apr 03, 2014 5:03:22 PM com.rapidminer.gui.ProcessThread run SEVERE: Process failed: Java heap space java.lang.OutOfMemoryError: Java heap space at com.rapidminer.example.table.DoubleArrayDataRow.ensureNumberOfColumns(DoubleArrayDataRow.java:72) at com.rapidminer.example.table.MemoryExampleTable.addAttributes(MemoryExampleTable.java:209) at com.rapidminer.operator.preprocessing.filter.NominalToNumericModel.applyOnDataDummyCoding(NominalToNumericModel.java:250) at com.rapidminer.operator.preprocessing.filter.NominalToNumericModel.applyOnData(NominalToNumericModel.java:196) at com.rapidminer.operator.preprocessing.PreprocessingModel.apply(PreprocessingModel.java:95) at com.rapidminer.operator.preprocessing.PreprocessingOperator.apply(PreprocessingOperator.java:130) at com.rapidminer.operator.AbstractExampleSetProcessing.doWork(AbstractExampleSetProcessing.java:116) at com.rapidminer.operator.Operator.execute(Operator.java:867) at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51) at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711) at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:375) at com.rapidminer.operator.Operator.execute(Operator.java:867) at com.rapidminer.Process.run(Process.java:949) at com.rapidminer.Process.run(Process.java:873) at com.rapidminer.Process.run(Process.java:832) at com.rapidminer.Process.run(Process.java:827) at com.rapidminer.Process.run(Process.java:817) at com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63) Apr 03, 2014 5:03:22 PM com.rapidminer.gui.ProcessThread run SEVERE: Here: Root[1] (Process) subprocess 'Main Process' +- Retrieve sample[1] (Retrieve) +- Normalize[1] (Normalize) +- Set Role[1] (Set Role) +- Sample[1] (Sample) ==> +- Nominal to Numerical[1] (Nominal to Numerical) +- KMeans[0] (k-Means) +- SVDReduction[0] (Singular Value Decomposition) My XML is like this:

In many cases, no target attribute (label) can be defined and the data should be automatically grouped. This procedure is called "Clustering". RapidMiner supports a wide range of clustering schemes which can be used in just the same way like any other learning scheme. This includes the combination with all preprocessing operators.

In this experimen, the well-known Iris data set is loaded (the label is loaded, too, but it is only used for visualization and comparison and not for building the clusters itself). One of the most simple clustering schemes, namely KMeans, is then applied to this data set. Afterwards, a dimensionality reduction is performed in order to better support the visualization of the data set in two dimensions.

Just perform the process and compare the clustering result with the original label (e.g. in the plot view of the example set). You can also visualize the cluster model itself.

Please let me know if any suggestions. Now I am in dilemma whether the rapidminer is suitable for large data ? Thanks in advance, Venkat

MariusHelf · Answer

As you can see from the error log the problematic operator is Nominal to Numerical. It creates one column for each nominal value you have in your dataset. If you have a lot of different values it creates many many rows which need a lot of memory. For example if you had a nominal id then alone from that you would end up with 65k columns.

k-Means can also work directly with nominal values if you select the mixed measures as distance measure.

Best regards,
Marius