"Clustering and writing back results into MySQl Database"
comsysto
New Altair Community Member
HI,
i did a webinar, few weeks ago.
So made Text clustering and i got 7 clusters with text.
So i want write back into the MYSQL Database to each Text Article and give a cluster.
For example.
t1 is cluster 1
t2 is cluster 2
t3 is cluster 1
So i have a new Column in my table and every article(Text) should get a culster.
But how to do. At the DatabaseExamplewriter i didn find such a option.
Regards
Stefan
i did a webinar, few weeks ago.
So made Text clustering and i got 7 clusters with text.
So i want write back into the MYSQL Database to each Text Article and give a cluster.
For example.
t1 is cluster 1
t2 is cluster 2
t3 is cluster 1
So i have a new Column in my table and every article(Text) should get a culster.
But how to do. At the DatabaseExamplewriter i didn find such a option.
Regards
Stefan
Tagged:
0
Answers
-
Hi Stefan,
unfortunately I didn't understand where your problem is. The DatabaseExampleWriter will write the table of examples into a table in your database. If you only want to have a subset of the example set in your database, you will have to filter out the undesired attributes first. You could use the AttributeFilter for example.
Greetings,
Sebastian0 -
Hi Sebastian,
if i proceed the KMEANS i got the Cluster Model. At Folder View there are 9 different clusters, where the Articles are classified.
In each Cluster i can see the ID's of the Text Files.
So i want write the the specified Cluster for each Article into the Database. I created a new Column at the database, in this should be written the Cluster_x which Rapidminer has given.
Regards
Stefan0 -
Hi,
and if this is the case, where's your problem using the DatabaseExampleSetWriter?
Greetings,
Sebastian0 -
Hi Sebastian,
Yeah !!! It worked. It was so simple like you said.
But i have a strange effect.
At the first time i extracted each article from database in a text file. And placed the text file in a subdir which was given from the database.
Because every article is categorized by the poster. So if i do clustering from the text files, i will get a different result then clustering from database with the same articles . Do you have any ideas why this happing ?
Regards
Stefan0 -
Hi,
I have some suspicions, but this would be only a guess. Please post your process here, otherwise I cannot see what you are doing at all.
Greetings,
Sebastian0 -
Hi Sebastian,
so i made some screenshots
http://img691.imageshack.us/img691/1376/database1.jpg
http://img690.imageshack.us/img690/6488/text1d.jpg
http://img690.imageshack.us/img690/6488/text1d.jpg
If its not enough informations please let me know what you need.
Regards
Stefan0 -
Hi,
please post the XML of your process? I cannot see through the image to check the parameter's of the operators.
Greetings,
Sebastian0 -
Hi Sebastian,
ok, here the xml for the database input :
<operator name="Root" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
<parameter key="username" value="profiler"/>
<parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
<parameter key="query" value="SELECT `CONTENT` FROM `DIM_ARTICLE` where ARTICLE_ID <> -1"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<parameter key="vector_creation" value="TermFrequency"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="GermanStopwordFilter" class="GermanStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="GermanStemmer" class="GermanStemmer">
</operator>
</operator>
<operator name="Nominal2Binominal" class="Nominal2Binominal" activated="no">
</operator>
<operator name="Nominal2Numerical" class="Nominal2Numerical">
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="9"/>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer">
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="cluster"/>
</operator>
<operator name="DatabaseExampleSetWriter" class="DatabaseExampleSetWriter">
<parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
<parameter key="username" value="profiler"/>
<parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
<parameter key="table_name" value="DIM_CLUSTER"/>
<parameter key="overwrite_mode" value="overwrite"/>
<parameter key="set_default_varchar_length" value="true"/>
</operator>
</operator>
And here the xml script for text files input:
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Dokumente und Einstellungen\Administrator\Eigene Dateien\rm_workspace\test.log"/>
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="3D Visualisierung" value="C:\xampp\htdocs\pentaho\3D Visualisierung"/>
<parameter key="Events" value="C:\xampp\htdocs\pentaho\Events"/>
<parameter key="Facility Management" value="C:\xampp\htdocs\pentaho\Facility Management"/>
<parameter key="Innenarchitektur" value="C:\xampp\htdocs\pentaho\Innenarchitektur"/>
<parameter key="Jobs" value="C:\xampp\htdocs\pentaho\Jobs"/>
<parameter key="Landschaftsarchitektur" value="C:\xampp\htdocs\pentaho\Landschaftsarchitektur"/>
<parameter key="Lichtplanung" value="C:\xampp\htdocs\pentaho\Lichtplanung"/>
<parameter key="Produkte" value="C:\xampp\htdocs\pentaho\Produkte"/>
<parameter key="Stadtplanung" value="C:\xampp\htdocs\pentaho\Stadtplanung"/>
<parameter key="Studium & Ausbildung" value="C:\xampp\htdocs\pentaho\Studium & Ausbildung"/>
<parameter key="Wettbewerbe" value="C:\xampp\htdocs\pentaho\Wettbewerbe"/>
<parameter key="News" value="C:\xampp\htdocs\pentaho\News"/>
<parameter key="Architektur" value="C:\xampp\htdocs\pentaho\Architektur"/>
</list>
<parameter key="default_content_language" value="german"/>
<parameter key="vector_creation" value="TermFrequency"/>
<parameter key="output_word_list" value="C:\Dokumente und Einstellungen\Administrator\Eigene Dateien\rm_workspace\training_words.list"/>
<parameter key="id_attribute_type" value="long"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="GermanStopwordFilter" class="GermanStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="GermanStemmer" class="GermanStemmer">
</operator>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="9"/>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer">
</operator>
</operator>
Regards
Stefan
0 -
Hi,
it seems to me, that you are working on different data: While the text input operator, reading from files will only use the text's itself, you are using each nominal attribute available for clustering, when loading from database. You should use the Nominal2String operator, to declare the text attribute as string and then uncheck the "filter_nominal" parameter. Then only the text is used and not each other nominal attribute, like label, path and so on.
Greetings,
Sebastian0 -
HI Sebastian,
i changed what you said, but still the same effect at the Database XML Model.
<operator name="Root" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
<parameter key="username" value="profiler"/>
<parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
<parameter key="query" value="SELECT `CONTENT` FROM `DIM_ARTICLE` where ARTICLE_ID <> -1"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="vector_creation" value="TermFrequency"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="GermanStopwordFilter" class="GermanStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="GermanStemmer" class="GermanStemmer">
</operator>
</operator>
<operator name="Nominal2Binominal" class="Nominal2Binominal" activated="no">
</operator>
<operator name="Nominal2Numerical" class="Nominal2Numerical">
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="9"/>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer">
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="cluster"/>
</operator>
<operator name="DatabaseExampleSetWriter" class="DatabaseExampleSetWriter">
<parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
<parameter key="username" value="profiler"/>
<parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
<parameter key="table_name" value="DIM_CLUSTER"/>
<parameter key="overwrite_mode" value="overwrite"/>
<parameter key="set_default_varchar_length" value="true"/>
</operator>
</operator>
Regards
Stefan0 -
Hi,
there's no obvious error in your process. Did you set a breakpoint after loading the data and checked if the attribute definitions were the same?
Greetings,
Sebastian0 -
Hi Sebastian,
now it's working. Don't know why :-)
Just another question. Is it possible to see a close a neighbor at kmeans is ?
Like the text with id1 is closer to the cluster point as text width id2 from cluster_1?
Regards
Stefan0 -
Hi,
you could build the Pairwise similarity table using the ExampleSet2Similarity or ExampleSet2SimilaritiyExampleSet operator. This will list all pairwise distances. If you choose Euclideandistance, this is the same as used in KMeans.
Greetings,
Sebastian0