Time Optimization

Updated Nov 5, 2024 by Jocelyn

Hi,
I am working with KMedoids clustering with 1.7MB text data.But it has been running for the last 3 and half days.The other operators took only 10 minutes .The KMedoids only taking the remaining time.Is there any way to optimize the process.The process is mentioned below.

<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Optimizing vector creation for text classification#ylt#/h3#ygt##ylt#p#ygt#This experiments shows how to apply a cross validation to a classifier that learns to separate two sets of texts.#ylt#/p#ygt#"/>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\data1.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<list key="namespaces">
</list>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
</operator>
<operator name="KMedoids" class="KMedoids">
<parameter key="k" value="25"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="is_nominal"/>
</operator>
<operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\cluster1.xls"/>
</operator>
</operator>

Thanks
Ratheesan

Find more posts tagged with

AI Studio

Sort by:

1 - 11 of 111

New Altair Community Member

Hi,
unfortunately it takes time to calculate all the distances needed. One hint: It might be useful to switch to CosineSimilarity. That's more suitable for text mining than euclidean distance.

Greetings,
Sebastian

New Altair Community Member

Thanks Sebastian,
Suppose I am using RM Enterprise edition,will it take the same amount of time when we are using RM Community version.

Thanks
Ratheesan

New Altair Community Member

Hi,
we have parallelized many important operators for the Enterprise Edition, but KMedoids is not part of it. But for the money of an Enterprise Edition, we could write you a parallelized KMedoids. One could even think about optimizing the operator for small example sets with many attributes like it is frequent in text mining tasks.

Greetings,
Sebastian

New Altair Community Member

Hi Sebastian,

I have tried the above process with Cosine similarity.But always getting the message " There is no obvious error,check the log file".Before applying KMedoids I used Attribute filter operator and selected numeric attributes because in KMedoids Numerical measures only provides Cosine similarity.

Thanks
Ratheesan

New Altair Community Member

Hi,
please send me your process. I will check if there's a bug.

Greetings,
Sebastian

New Altair Community Member

Hi Sebastian,

Thanks for your valuable help. This is my process

<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\data1.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<list key="namespaces">
</list>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="is_numerical"/>
<parameter key="parameter_string" value="sample"/>
<parameter key="apply_on_special" value="true"/>
</operator>
<operator name="KMedoids" class="KMedoids">
<parameter key="k" value="3"/>
<parameter key="max_runs" value="5"/>
<parameter key="max_optimization_steps" value="10"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\modelcluster.xls"/>
</operator>
</operator>

If am using up to 250 records,its working properly but if going for more than 250 records I am getting the above message.

Thanks
Ratheesan.

New Altair Community Member

Dec 15, 2009

Hi,
the process just runs fine on here. I used 722 texts, but there was no error, at least not at the first few minutes of the KMedoids run.

Of course I don't have exactly the same setup, because I'm using different texts. Uhm. I suggest, you should switch your RapidMiner to debug mode, so that you could post me the detailed error message. Go to the Tools menu and select Preferences. Enable the rapidminer.general.debugmode checkbox in the tab General.
Then please reexecute the process and send me the error message.

Greetings,
Sebastian

New Altair Community Member

Dec 15, 2009

Hi Sebastian,

I reexecuted the process after changing to the debug mode.Here I am attaching the error message.

Root[1] (Process)
+- ExcelExampleSource[1] (ExcelExampleSource)
+- Nominal2String[1] (Nominal2String)
+- StringTextInput[1] (StringTextInput)
| +- ToLowerCaseConverter[600] (ToLowerCaseConverter)
| +- StringTokenizer[600] (StringTokenizer)
| +- EnglishStopwordFilter[600] (EnglishStopwordFilter)
| +- TokenLengthFilter[600] (TokenLengthFilter)
+- AttributeFilter (2)[1] (AttributeFilter)
here ==> +- KMedoids[1] (KMedoids)
java.lang.NullPointerException
at com.rapidminer.operator.clustering.clusterer.KMedoids.generateClusterModel(KMedoids.java:176)
at com.rapidminer.operator.clustering.clusterer.AbstractClusterer.apply(AbstractClusterer.java:60)
at com.rapidminer.operator.Operator.apply(Operator.java:671)
at com.rapidminer.operator.OperatorChain.apply(OperatorChain.java:424)
at com.rapidminer.operator.Operator.apply(Operator.java:671)
at com.rapidminer.Process.run(Process.java:735)
at com.rapidminer.Process.run(Process.java:704)
at com.rapidminer.Process.run(Process.java:694)
at com.rapidminer.gui.ProcessThread.run(ProcessThread.java:59)

Thanks
Ratheesan.

New Altair Community Member

Dec 16, 2009

Hi,
that's quite strange. The distance measure seems to return NaN, that's the only way, why this could happen.
Unfortunately I cannot debug anything more detailed, because I can't reproduce this error. Do you have any missing values in your data?

Greetings,
Sebastian

New Altair Community Member

Dec 16, 2009

Hi Sebastian,

Here I have no missing value.But I am getting the output using Dice similarity.Is it meaningful for using Dice similarity in text mining.

Thanks
Ratheesan.