Classify text stored in SQL records
IngoRM
New Altair Community Member
Original message from SourceForge forum at http://sourceforge.net/forum/forum.php?thread_id=2044574&;forum_id=390413
HI
Congratulations on releasing RapidMiner 4.1!!!
I am trying to classify text stored in SQL records. Each text is an email transaction I want to cluster (no supervised learning yet). I have not found an easy method to get the text into the classifier.
1- TextInput uses directories as document source, and not individual records or files.
2- DatabaseExampleSource -> StringTextInput does not provide a way to specify which field is text. I select FilterNominalAttributes but I still have an error message about stream input.
3- Operator->Preprocessing->Attributes does not have a nominal to string converter. There is ChangeAttributeType operator, but it seems to be the same as ChangeAttributeRole. (Shouldn't Type be string, boolean, integer, etc.?)
Is it possible to move the text fields directly into the text classifier, or do I have to export and transform them to a RM format (aml/dat) or import?
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/Rapid"/>
<parameter key="id_attribute" value="RecIDNbr"/>
<parameter key="password" value="xxx"/>
<parameter key="query" value="SELECT [RecIDNbr], [Service] FROM [CustomerHist]"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
</operator>
</operator>
Error message:
Error in: StringTextInput (StringTextInput) The input example set does not contain any attribues with value type string. Some operators require example sets with attributes of a specific value type. Please refer to the documentation of the used operators for further details.
The Input Example Set does not contain any attributes with value type string.
Thank you.
Answer by Ingo:
Hello,
could you please provide us a screenshot of the result right (meta data view and data view) after loading it in from your database, i.e. right after
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/Rapid"/>
<parameter key="id_attribute" value="RecIDNbr"/>
<parameter key="password" value="xxx"/>
<parameter key="query" value="SELECT [RecIDNbr], [Service] FROM [CustomerHist]"/>
<parameter key="username" value="sa"/>
</operator>
You could for example upload the screenshots here:
http://tinypic.com/
and post the links. Please considering blacking out sensible contents if necessary. I just have to be sure that everything is fine for using the StringTextInput operator.
Thanks in advance. Cheers,
Ingo
HI
Congratulations on releasing RapidMiner 4.1!!!
I am trying to classify text stored in SQL records. Each text is an email transaction I want to cluster (no supervised learning yet). I have not found an easy method to get the text into the classifier.
1- TextInput uses directories as document source, and not individual records or files.
2- DatabaseExampleSource -> StringTextInput does not provide a way to specify which field is text. I select FilterNominalAttributes but I still have an error message about stream input.
3- Operator->Preprocessing->Attributes does not have a nominal to string converter. There is ChangeAttributeType operator, but it seems to be the same as ChangeAttributeRole. (Shouldn't Type be string, boolean, integer, etc.?)
Is it possible to move the text fields directly into the text classifier, or do I have to export and transform them to a RM format (aml/dat) or import?
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/Rapid"/>
<parameter key="id_attribute" value="RecIDNbr"/>
<parameter key="password" value="xxx"/>
<parameter key="query" value="SELECT [RecIDNbr], [Service] FROM [CustomerHist]"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
</operator>
</operator>
Error message:
Error in: StringTextInput (StringTextInput) The input example set does not contain any attribues with value type string. Some operators require example sets with attributes of a specific value type. Please refer to the documentation of the used operators for further details.
The Input Example Set does not contain any attributes with value type string.
Thank you.
Answer by Ingo:
Hello,
could you please provide us a screenshot of the result right (meta data view and data view) after loading it in from your database, i.e. right after
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/Rapid"/>
<parameter key="id_attribute" value="RecIDNbr"/>
<parameter key="password" value="xxx"/>
<parameter key="query" value="SELECT [RecIDNbr], [Service] FROM [CustomerHist]"/>
<parameter key="username" value="sa"/>
</operator>
You could for example upload the screenshots here:
http://tinypic.com/
and post the links. Please considering blacking out sensible contents if necessary. I just have to be sure that everything is fine for using the StringTextInput operator.
Thanks in advance. Cheers,
Ingo
Tagged:
0
Answers
-
Ingo
Here is a sample table with some text fields from a SQL database. I use the DatabaseExample operator to link to the database and pull the text fields into RM.
These pictures show the initial setup and linking with text fiels. I can specify ID, label and attribute. I cannot declare an attribute type, such as string or integer.
http://tinypic.com/view.php?pic=n6ylud&;s=3
http://tinypic.com/view.php?pic=app1d&;s=3
Database Example connection is prepared
http://tinypic.com/view.php?pic=25kkcr8&;s=3
StringTextInput is setup after Database Example. Database Example output is Example Set, and input for StringTextInput is Example Set. However, the attribute type is not carried from the SQL database through the Example connection, and I can not declare it in the first step.
http://tinypic.com/view.php?pic=dxl538&;s=3
Metadata view
http://tinypic.com/view.php?pic=2l9o3z9&;s=3
Data view
http://tinypic.com/view.php?pic=xfp7k0&;s=3
Error message
http://tinypic.com/view.php?pic=1z4czt0&;s=3
http://tinypic.com/view.php?pic=vhzqwx&;s=3
(These pictures don't seem clear enough. If you can't read them, I can send them to you directly.)
The Operator Chain
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/datarapid"/>
<parameter key="id_attribute" value="RecId"/>
<parameter key="password" value="y6sa3JX9Wrc="/>
<parameter key="query" value="SELECT * FROM [tblArticles]"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer" breakpoints="after">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
</operator>
</operator>
Thanks for your help.
0 -
Hello,
thanks for your great pictures and description. So it was very easy to find the reason for this behavior: there was a bug in the StringTextInput operator so that it was only possible to work on String attributes. Since there is no operator "Nominal2String" or someting similar available, this is of course only possible when you work with .aml files but not (as you have noticed) when you load the data from a database.
We fixed the Text plugin and uploaded a new version at:
Windows Installer: http://rapid-i.com/snapshot/rapidminer-text-4.1-installer.exe
Other platforms: http://rapid-i.com/snapshot/rapidminer-text-4.1.jar
Sorry for the inconvenience and thanks again for the great report and for pointing this bug out.
Cheers,
Ingo0 -
Ingo
Thank you for the very fast response and posting the update in a Windows installer.
Now, the text in each record is brought into RM via DatabaseExample and passed to the next operator. The import process runs without errors when I set FilterNominalAttributes = On. The process runs to the end but the StringTextInput does not parse the text and create vectors (or I am missing something.)
Also, when I set FilterNominalAttributes = Off, then the process stops and generates an error.
I checked the installation, and the /lib/plugin file is dated 5/29/2008.
Here are screen captures.
=========================
DatabaseExample pulls from the SQL database.
1
http://tinypic.com/view.php?pic=2qjf57d&;s=3
StringTextInput
2
Filter NominalAttributes = On
http://tinypic.com/view.php?pic=2qjf57d&;s=3
Breakpoint 1 metadata
3
http://tinypic.com/view.php?pic=121y0qw&;s=3
http://i27.tinypic.com/2rmxni0.jpg
Breakpoint 1 data
4
http://i31.tinypic.com/21mf3ae.jpg
Breakpoint 2 Metadata result
StringTextInput should convert text to word vectors but the text remains as one string
5
http://i27.tinypic.com/14tt4io.jpg
Breakpoint 2 data
Text records are not parsed to individual word vectors
6
http://i30.tinypic.com/bdldw2.jpg
This time FilterNominalAttributes = Off
7
http://i26.tinypic.com/ei81go.jpg
Error message
8
http://i32.tinypic.com/2hqur79.jpg0 -
Hi,
thanks again for the detailed description!
That's the way it should work. Usually, the StringTextInput operator works on String attributes only and does not transform nominal attributes. This is for example useful if you have stored the texts in attributes with value type string and have other (nominal) attributes which store additional information about the texts (e.g. author...) which should not be transformed. Hence the name StringTextInput. But since some of the IO operators do not support String attributes and nominal ones are by far more common to the users, we decided to add the parameter so that the operator can also directly work on nominal attributes. So in your case you simply can activate the parameter and everything is fine.Also, when I set FilterNominalAttributes = Off, then the process stops and generates an error.
Yes, and I totally forgot to mention this in my last post, sorry...Now, the text in each record is brought into RM via DatabaseExample and passed to the next operator. The import process runs without errors when I set FilterNominalAttributes = On. The process runs to the end but the StringTextInput does not parse the text and create vectors (or I am missing something.)
The new Text plugin concept (introduced in RM 4.0 I think) makes the text preprocessing much more modular than before. This means that all steps of preprocessing can now defined as operators (with own parameters each) which have to be added as children to the TextInput / SingleTextInput / StringTextInput operators. Usually, you want to add a tokenizer first (e.g. StringTokenizer), then maybe Stemming etc. If no preprocessing operators were added to the *TextInput operator, no vectors are created at all.
Actually, there shold be a warning message stating something like "There are no suboperators for this operator. This is usually not intended. You would probably like to at least add a tokenizer".
So, the solution is quite simple: just add some text preprocessing operators like for example in this setup:
Cheers,
<operator name="Root" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
</operator>
</operator>
Ingo0 -
OK
The extra information about the sub-operators is what I didn't understand. It works when I add Tokenizer and Stopword. Thank you for the help.
Setting FilterNominalAttributes to On seemed to be the correct thing to do, but I wasn't getting parsed results so I tried setting it to Off.
So much to learn.
You should consider selling training videos. RM is an excellent tool to improve Data Mining skills. It's the only tool I've found that is easy enough for non-developers to use and less expensive than SAS and the other systems. Many of the open source systems are academic-based and require Java or other programming knowledge.0 -
Hi again,
great to hear that it works now.
We actually also considered to create and sell training videos and they will probably come some time. For now, we decided to provide in-house training courses and courses here in Germany for very small groups so we can discuss exactly the problems of the users and work on data brought with them (if the data is not too sensible of course). The training effect is much larger than and all attendees confirmed this after the course. We would of course not get the same amount of personal and problem-related discussions with videos alone.
On the other hands, for standard processes it is definitely a good idea to provide web casts and / or training videos so they will come for basic tasks.
Thanks for your kind words and all the best,
Ingo0 -
This question might be similar. My texts are stored in an Excel file with one column identifying the member and one containing comments. Would I use the ExcelExample operator to get the data into RM and then the SingleText operator to create vectors? Thanks!0
-
Hello,
almost. It's ExcelExampleSource together with the StringTextInput (not SingleTextInput) with the parameter filter_nominal_attributes set to true. And don't forget to add inner operators to the text preprocessing operator
Cheers,
Ingo0 -
I have a similar question... I have 2 fields, one for data ID and one for a description of each record. However, the descriptions are so long (max = 65,535 chars) that I can't set them as string types... I'm using MySQL and I'm trying to use the text plugin to create word vectors for each record. When I use other data (in string form) I can get RM to create vectors from it, but I haven't had the same luck with what I really want to do. I hope that kind of makes sense...0
-
AmI
Go to this post
http://rapid-i.com/rapidforum/index.php/topic,78.0.html
There are some screen shots of setting up your SQL link and the operators and parameters to set. Download the STI patch and set filter_nominal_attributes to ON/TRUE. Place your ID field as the first field to select so it will show up as the left part of the string, and set ID_attribute_type to short or long so the text is used as the ID. Otherwise RM will overwrite your ID.
This is a workaround until STI can correctly pass ID fields through the operator.
Let us know how well your long descriptions work. I haven't used strings more than a few sentences or an email's amount of text, and can't tell you if such large amounts of text will work correctly.
B.0 -
Hello,
just wanted to let you know that we just made a new release of RapidMiner (version: 4.2) containing a bug fix for the ID problem of the StringTextInput operator. The links will be available on our website during the next hours.
Cheers,
Ingo0