Classify text stored in SQL records

IngoRM
IngoRM New Altair Community Member
edited November 5 in Community Q&A
Original message from SourceForge forum at http://sourceforge.net/forum/forum.php?thread_id=2044574&;forum_id=390413

HI

Congratulations on releasing RapidMiner 4.1!!!

I am trying to classify text stored in SQL records. Each text is an email transaction I want to cluster (no supervised learning yet). I have not found an easy method to get the text into the classifier. 

1- TextInput uses directories as document source, and not individual records or files.
2- DatabaseExampleSource -> StringTextInput does not provide a way to specify which field is text. I select FilterNominalAttributes but I still have an error message about stream input.
3- Operator->Preprocessing->Attributes does not have a nominal to string converter. There is ChangeAttributeType operator, but it seems to be the same as ChangeAttributeRole. (Shouldn't Type be string, boolean, integer, etc.?)

Is it possible to move the text fields directly into the text classifier, or do I have to export and transform them to a RM format (aml/dat) or import?




<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/Rapid"/>
<parameter key="id_attribute" value="RecIDNbr"/>
<parameter key="password" value="xxx"/>
<parameter key="query" value="SELECT [RecIDNbr], [Service] FROM [CustomerHist]"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
</operator>
</operator>
Error message:

Error in: StringTextInput (StringTextInput) The input example set does not contain any attribues with value type string. Some operators require example sets with attributes of a specific value type. Please refer to the documentation of the used operators for further details.

The Input Example Set does not contain any attributes with value type string.


Thank you.


Answer by Ingo:

Hello,

could you please provide us a screenshot of the result right (meta data view and data view) after loading it in from your database, i.e. right after 

<operator name="DatabaseExampleSource" class="DatabaseExampleSource"> 
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/> 
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/Rapid"/> 
<parameter key="id_attribute" value="RecIDNbr"/> 
<parameter key="password" value="xxx"/> 
<parameter key="query" value="SELECT [RecIDNbr], [Service] FROM [CustomerHist]"/> 
<parameter key="username" value="sa"/> 
</operator> 

You could for example upload the screenshots here:

http://tinypic.com/

and post the links. Please considering blacking out sensible contents if necessary. I just have to be sure that everything is fine for using the StringTextInput operator.

Thanks in advance. Cheers,
Ingo

Answers

  • B_
    B_ New Altair Community Member
    Ingo

    Here is a sample table with some text fields from a SQL database.  I use the DatabaseExample operator to link to the database and pull the text fields into RM. 

    These pictures show the initial setup and linking with text fiels.  I can specify ID, label and attribute.  I cannot declare an attribute type, such as string or integer.

    http://tinypic.com/view.php?pic=n6ylud&;s=3
    http://tinypic.com/view.php?pic=app1d&;s=3

    Database Example connection  is prepared
    http://tinypic.com/view.php?pic=25kkcr8&;s=3

    StringTextInput is setup after Database Example.  Database Example output is Example Set, and input for StringTextInput is Example Set.  However, the attribute type is not carried from the SQL database through the Example connection, and I can not declare it in the first step.
    http://tinypic.com/view.php?pic=dxl538&;s=3

    Metadata view
    http://tinypic.com/view.php?pic=2l9o3z9&;s=3

    Data view
    http://tinypic.com/view.php?pic=xfp7k0&;s=3


    Error message
    http://tinypic.com/view.php?pic=1z4czt0&;s=3
    http://tinypic.com/view.php?pic=vhzqwx&;s=3

    (These  pictures don't seem clear enough.  If you can't read them, I can send them to you directly.)

    The Operator Chain

    <operator name="Root" class="Process" expanded="yes">
        <description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information  for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
        <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
            <parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
            <parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/datarapid"/>
            <parameter key="id_attribute" value="RecId"/>
            <parameter key="password" value="y6sa3JX9Wrc="/>
            <parameter key="query" value="SELECT * FROM [tblArticles]"/>
            <parameter key="username" value="sa"/>
        </operator>
        <operator name="ExampleVisualizer" class="ExampleVisualizer" breakpoints="after">
        </operator>
        <operator name="StringTextInput" class="StringTextInput" expanded="yes">
            <parameter key="filter_nominal_attributes" value="true"/>
            <list key="namespaces">
            </list>
        </operator>
    </operator>

    Thanks for your help. 
  • IngoRM
    IngoRM New Altair Community Member
    Hello,

    thanks for your great pictures and description. So it was very easy to find the reason for this behavior: there was a bug in the StringTextInput operator so that it was only possible to work on String attributes. Since there is no operator "Nominal2String" or someting similar available, this is of course only possible when you work with .aml files but not (as you have noticed) when you load the data from a database.

    We fixed the Text plugin and uploaded a new version at:

    Windows Installer: http://rapid-i.com/snapshot/rapidminer-text-4.1-installer.exe
    Other platforms: http://rapid-i.com/snapshot/rapidminer-text-4.1.jar

    Sorry for the inconvenience and thanks again for the great report and for pointing this bug out.

    Cheers,
    Ingo
  • B_
    B_ New Altair Community Member
    Ingo

    Thank you for the very fast response and posting the update in a Windows installer.

    Now, the text in each record is brought into RM via DatabaseExample and passed to the next operator.  The import process runs without errors when I set FilterNominalAttributes = On.  The process runs to the end but the StringTextInput does not parse the text and create vectors (or I am missing something.) 

    Also, when I set FilterNominalAttributes = Off, then the process stops and generates an error.

    I checked the installation, and the /lib/plugin file is dated 5/29/2008.

    Here are screen captures.
    =========================

    DatabaseExample pulls from the SQL database. 
    1
    http://tinypic.com/view.php?pic=2qjf57d&;s=3


    StringTextInput
    2
    Filter NominalAttributes = On
    http://tinypic.com/view.php?pic=2qjf57d&;s=3


    Breakpoint 1 metadata
    3
    http://tinypic.com/view.php?pic=121y0qw&;s=3
    http://i27.tinypic.com/2rmxni0.jpg


    Breakpoint 1 data
    4
    http://i31.tinypic.com/21mf3ae.jpg


    Breakpoint 2 Metadata result
    StringTextInput should convert text to word vectors but the text remains as one string
    5
    http://i27.tinypic.com/14tt4io.jpg


    Breakpoint 2 data
    Text records are not parsed to individual word vectors
    6
    http://i30.tinypic.com/bdldw2.jpg




    This time FilterNominalAttributes = Off
    7
    http://i26.tinypic.com/ei81go.jpg


    Error message
    8
    http://i32.tinypic.com/2hqur79.jpg
  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    thanks again for the detailed description!
    Also, when I set FilterNominalAttributes = Off, then the process stops and generates an error.
    That's the way it should work. Usually, the StringTextInput operator works on String attributes only and does not transform nominal attributes. This is for example useful if you have stored the texts in attributes with value type string and have other (nominal) attributes which store additional information about the texts (e.g. author...) which should not be transformed. Hence the name StringTextInput. But since some of the IO operators do not support String attributes and nominal ones are by far more common to the users, we decided to add the parameter so that the operator can also directly work on nominal attributes. So in your case you simply can activate the parameter and everything is fine.
    Now, the text in each record is brought into RM via DatabaseExample and passed to the next operator.  The import process runs without errors when I set FilterNominalAttributes = On.  The process runs to the end but the StringTextInput does not parse the text and create vectors (or I am missing something.)
    Yes, and I totally forgot to mention this in my last post, sorry...

    The new Text plugin concept (introduced in RM 4.0 I think) makes the text preprocessing much more modular than before. This means that all steps of preprocessing can now defined as operators (with own parameters each) which have to be added as children to the TextInput / SingleTextInput / StringTextInput operators. Usually, you want to add a tokenizer first (e.g. StringTokenizer), then maybe Stemming etc. If no preprocessing operators were added to the *TextInput operator, no vectors are created at all.

    Actually, there shold be a warning message stating something like "There are no suboperators for this operator. This is usually not intended. You would probably like to at least add a tokenizer".

    So, the solution is quite simple: just add some text preprocessing operators like for example in this setup:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        </operator>
        <operator name="StringTextInput" class="StringTextInput" expanded="yes">
            <parameter key="filter_nominal_attributes" value="true"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="PorterStemmer" class="PorterStemmer">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
        </operator>
    </operator>
    Cheers,
    Ingo
  • B_
    B_ New Altair Community Member
    OK

    The extra information about the sub-operators is what I didn't understand.  It works when I add Tokenizer and Stopword.  Thank you for the help.

    Setting FilterNominalAttributes to On seemed to be the correct thing to do, but I wasn't getting parsed  results so I tried setting it to Off.

    So much to learn. 

    You should consider selling training videos.  RM is an excellent tool to improve Data Mining skills.  It's the only tool I've found that is easy enough for non-developers to use and less expensive than SAS and the other systems.  Many of the open source systems are academic-based and require Java or other programming knowledge.
  • IngoRM
    IngoRM New Altair Community Member
    Hi again,

    great to hear that it works now.

    We actually also considered to create and sell training videos and they will probably come some time. For now, we decided to provide in-house training courses and courses here in Germany for very small groups so we can discuss exactly the problems of the users and work on data brought with them (if the data is not too sensible of course). The training effect is much larger than and all attendees confirmed this after the course. We would of course not get the same amount of personal and problem-related discussions with videos alone.

    On the other hands, for standard processes it is definitely a good idea to provide web casts and / or training videos so they will come for basic tasks.

    Thanks for your kind words and all the best,
    Ingo
  • rdmckinney
    rdmckinney New Altair Community Member
    This question might be similar. My texts are stored in an Excel file with one column identifying the member and one containing comments. Would I use the ExcelExample operator to get the data into RM and then the SingleText operator to create vectors? Thanks!
  • IngoRM
    IngoRM New Altair Community Member
    Hello,

    almost.  It's ExcelExampleSource together with the StringTextInput (not SingleTextInput) with the parameter filter_nominal_attributes set to true. And don't forget to add inner operators to the text preprocessing operator  ;)

    Cheers,
    Ingo
  • amithatobvious
    amithatobvious New Altair Community Member
    I have a similar question... I have 2 fields, one for data ID and one for a description of each record. However, the descriptions are so long (max = 65,535 chars) that I can't set them as string types... I'm using MySQL and I'm trying to use the text plugin to create word vectors for each record. When I use other data (in string form) I can get RM to create vectors from it, but I haven't had the same luck with what I really want to do. I hope that kind of makes sense...
  • B_
    B_ New Altair Community Member
    AmI

    Go to this post
    http://rapid-i.com/rapidforum/index.php/topic,78.0.html

    There are some screen shots of setting up your SQL link and the operators and parameters to set.  Download the STI patch and set filter_nominal_attributes to ON/TRUE.  Place your ID field as the first field to select so it will show up as the left part of the string, and set ID_attribute_type  to short or long so the text is used as  the ID.  Otherwise RM will overwrite your ID.

    This is a workaround until STI can correctly pass ID fields through the operator.

    Let us know how well your long descriptions work.  I haven't used strings more than a few sentences or an email's amount of text, and can't tell you if such large amounts of text will work correctly.

    B.
  • IngoRM
    IngoRM New Altair Community Member
    Hello,

    just wanted to let you know that we just made a new release of RapidMiner (version: 4.2) containing a bug fix for the ID problem of the StringTextInput operator. The links will be available on our website during the next hours.

    Cheers,
    Ingo