KernelKMeans now produces error when classify text
B_
New Altair Community Member
RM team
I have switched to RM 4.2. I began testing by using an existing project that classifies text by KernelKMeans. Text is read from a database and passed through StringtextInput and StringTokenizer. This operator chain worked before. Now I receive an error message
Error 104 - non-numeric
Error in: KernelKMeans (KernelKMeans) The example set contains non-numerical attribute #0: StockItemDesc (nominal/single_value)/values=
Using KMediods to classify text works. Looking at the metadata with examplevisualizer there are string vectors and weights.
Here is the project.
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/XXX"/>
<parameter key="id_attribute" value="IDNbr"/>
<parameter key="password" value="y6sa3JX9Wrc="/>
<parameter key="query" value="SELECT [Text], [IDNbr] FROM [Classify]"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer" breakpoints="before">
</operator>
<operator name="KernelKMeans" class="KernelKMeans" breakpoints="after">
<parameter key="k" value="500"/>
<parameter key="kernel_type" value="KernelDot"/>
</operator>
<operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
<parameter key="keep_cluster_model" value="false"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="Example.dat"/>
<parameter key="special_format" value="$i $v[cluster]"/>
</operator>
</operator>
Thanks for your help.
B
I have switched to RM 4.2. I began testing by using an existing project that classifies text by KernelKMeans. Text is read from a database and passed through StringtextInput and StringTokenizer. This operator chain worked before. Now I receive an error message
Error 104 - non-numeric
Error in: KernelKMeans (KernelKMeans) The example set contains non-numerical attribute #0: StockItemDesc (nominal/single_value)/values=
Using KMediods to classify text works. Looking at the metadata with examplevisualizer there are string vectors and weights.
Here is the project.
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/XXX"/>
<parameter key="id_attribute" value="IDNbr"/>
<parameter key="password" value="y6sa3JX9Wrc="/>
<parameter key="query" value="SELECT [Text], [IDNbr] FROM [Classify]"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer" breakpoints="before">
</operator>
<operator name="KernelKMeans" class="KernelKMeans" breakpoints="after">
<parameter key="k" value="500"/>
<parameter key="kernel_type" value="KernelDot"/>
</operator>
<operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
<parameter key="keep_cluster_model" value="false"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="Example.dat"/>
<parameter key="special_format" value="$i $v[cluster]"/>
</operator>
</operator>
Thanks for your help.
B
0
Answers
-
After an hour or two of work, KMedoids fails with an Index Out of Bounds error message.
It does not fail immediately on starting like KernelKmeans does now.0 -
KMeans also stops with an error "example set contains non numerical attributes #0"0
-
Hello,
are you sure this process did work with RM 4.1 and before? I am asking because as far as I can see the "usual" kernel functions of RapidMiner are used and those never supported nominal values...
Hoever, you could of course use the operator Nominal2Numeric before the clustering, it might even be more appropriate to apply a Nominal2Binominal first.
Cheers,
Ingo0 -
Ingo
I reinstalled RM 4.1 alongside RM 4.2. I tested this project. It runs under 4.1 and fails under 4.2.
Same SQL query to pull records and same text in the records.
+++++++++++++
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/SqlServer"/>
<parameter key="id_attribute" value="RecID"/>
<parameter key="password" value="y6sa3JX9Wrc="/>
<parameter key="query" value="SELECT [Text1], [Text2], [RecID] FROM"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer">
</operator>
<operator name="KernelKMeans" class="KernelKMeans">
<parameter key="k" value="500"/>
<parameter key="kernel_type" value="KernelDot"/>
</operator>
<operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
<parameter key="keep_cluster_model" value="false"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\TestDataOutput.dat"/>
<parameter key="special_format" value="$i $v[cluster]"/>
</operator>
</operator>
+++
4.2 error message
Error in: KernelKMeans (KernelKMeans) The example set contains non-numerical attribute #0: StockItemDesc
++++++++++++++++++
<as far as I can see the "usual" kernel functions of RapidMiner are used and those never supported nominal values>
Doesn't the FilterNominalAttributes convert the attributes to a usable format for further processing?
Thanks for your help.
B.0 -
Hi,
thanks for this info. I now found the reason for this behaviour. It has actually nothing to do with the clustering operator but with the StringTextInput. There is a new parameter "remove_original_attributes" which unfortunately has not the default setting "true" (in order to keep backwards compatibility) but "false" so the original nominal (or string) attributes were not removed. This have caused the error for the clustering since the kernel cannot handle nominal values which are still present in the data set if the parameter "remove_original_attributes" was not set to "true". So the solution is quite simple: just set this parameter to "true" and everything should work as usual. You could add a breakpoint after the StringTextInput operator to see the difference with and without this setting.
I reinstalled RM 4.1 alongside RM 4.2. I tested this project. It runs under 4.1 and fails under 4.2.
Yes, but with the new parameter they are also still kept as part of the example set as long as "remove_original_attributes" is set to "false". Instead of removing the directly here (with the parameter setting mentioned above) you could of course also use the operator "AttributeFilter" after the text processing to filter out all nominal attributes and only keep the numerical ones.
Doesn't the FilterNominalAttributes convert the attributes to a usable format for further processing?
Cheers,
Ingo
0 -
Ingo
This runs successfully now. Thanks for the help.
B.0