StringTextInput discards original ID values, replaces with different values

I use custom ID values for each record in my database. StringTextInput discards these ID values and the ID attribute name and inserts its own attribute name and Id value.

Is there a way to keep my original ID values? I notice the new ID values do not match the original records and I can not match my records to cluster results. See step 3 in the screen capture to see how these change.

<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#h3#ygt#Specifying texts by an example 

set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for 

setting up the directories from which the text documents are read. Sometimes, however, a 

more flexible solution is needed. If, for instance, your text documents have different types 

of encoding or are written in different languages, you might wish to provide this 

information  for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by 

using an example set that contains one row for each input directory and corresponding 

attributes for source, encoding, type and class. If such an example set is provided, the 

texts in the parameter list are ignored.#ylt#/p#ygt#"/>
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_system"	value="Microsoft SQL Server (JTDS)"/>
        <parameter key="database_url"	value="jdbc:jtds:sqlserver://localhost:1433/xxx"/>
        <parameter key="id_attribute"	value="RecID"/>
        <parameter key="password"	value="qqqqq"/>
        <parameter key="query"	value="SELECT * FROM [tblGolfTest]"/>
        <parameter key="username"	value="sa"/>
    </operator>
    <operator name="ExampleVisualizer (Step1)" class="ExampleVisualizer" 

breakpoints="before">
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="filter_nominal_attributes"	value="true"/>
        <list key="namespaces">
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
    </operator>
    <operator name="ExampleVisualizer (Step2)" class="ExampleVisualizer" 

breakpoints="before">
    </operator>

[attachment deleted by admin]

Find more posts tagged with

AI Studio

Accepted answers

All comments

TobiasMalbrecht

Hi,

I just checked and reproduced your problem. The StringTextInput-Operator seems to simply create a ID new attribute instead of keeping the old one. As I am not that familiar with the text plugin and its implementation I do not know whether this is intended or if this is an accidental implementation artefact. I assume the latter reason is the case. I will check this and post again when I know more about this issue.

Regards,
Tobias

Tobias,

If the original ID is discarded there is no easy way to link results back to original data. I think this is an error

B..

Tobias

Have you determined what the problem is with ID not carrying through the process? thanks

jdouet

Hi Tobias, Hi All,

I have a problem both with DatabaseExampleSource and with StringTextInput. I have PHP/MySQL blogs, and I caught the dump/backup files to import them on my local mySQL server.
Then I constructed my SQL query, but expecting that "post_content" and "post_title" should be of "string" type. On the original DB they were "text" but after importing they are "nominal"; what can I do ?
I have used "filter nominal attributes" but it is refused since there is no string attribute in my resulting Exampleset.

Cheers,
Jean-Charles.

Jean-Charles

For StringTextInput, set filter_nominal_attributes to ON/TRUE. I am able to get text to into the STI operator from my SQL database. However, the record ID is discarded in STI, and you will not be able to match results back to the original data in your SQL database.

HTH
B.

<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/database"/>
<parameter key="id_attribute" value="RecID"/>
<parameter key="password" value="zzz"/>
<parameter key="query" value="SELECT "/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">

<parameter key="filter_nominal_attributes" value="true"/>

<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>

Jean-Charles

I forgot to mention your SQL query in DBExampleSource will be pulling the text fields from your database.

<parameter key="query" value="SELECT TextField1, TextField2, ....... From Table"/>

I haven't mixed text and other data types such as numeric or dates so I can't tell you what will happen.

B.

jdouet

Hi B.,

I have tried with "filter nominal attributes" : nothing...
Here is my experiment :

<operator name="travail_sur_dump_evoblogs" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource" breakpoints="after">
<description text="voir problème des types text/nominal :côté SQL avec charset, cast et convertcôté RM en bricolant l'AML, et en demandant à Ingo"/>
<parameter key="database_url" value="jdbc:mysql://localhost:3306/installer0018843"/>
<parameter key="id_attribute" value="ID"/>
<parameter key="label_attribute" value="post_category"/>
<parameter key="password" value="dummy"/>
<parameter key="query" value="select ID, post_issue_date, post_content, post_title, post_category from evo_posts;"/>
<parameter key="table_name" value="evo_posts"/>
<parameter key="username" value="root"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter" activated="no">
<parameter key="attribute_description_file" value="C:\Documents and Settings\JCD\Bureau\outils de recherche\analyses statistiques\analyse_site\table_posts_blogs.aml"/>
<parameter key="example_set_file" value="C:\Documents and Settings\JCD\Bureau\outils de recherche\analyses statistiques\analyse_site\data\table_posts_blogs.dat"/>
</operator>
<operator name="texte" class="OperatorChain" expanded="yes">
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="create_text_visualizer" value="true"/>
<parameter key="default_content_encoding" value="windows-1252"/>
<parameter key="default_content_language" value="french"/>
<parameter key="default_content_type" value="html"/>
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<parameter key="vector_creation" value="TermOccurrences"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
</operator>
</operator>

"post_title" and "post_content" are "text" types, while there are "enum(published, private)" types that had rather be of nominal type. But it does not seem to work this way...

Jean-Charles.

PS : I have used "GuessValueType" and nothing...It would be interesting that nominal attributes for which each value contains blank spaces should be recognized as "string", shouldn't it ?

Jean-Charles

I notice your process structure is a little different from mine.

You read data from mySQL and save it with ExampleWriter then continue to an OperatorChain with StringTextInput. You do not read your example back into the process. You also have a date type (post_issue_date) that you select.

DBExampleSource
ExampleSetWriter
OperaterChain (Text)
STI
StringTokenizer

My structure is more direct and I don't have date or non-text fields.

DBExampleSource
STI
StringTokenizer

These are the only differences I see between the two processes. Can you rearrange your process to match mine (leave out ExampleWriter and don't put STI and Tokenizer in an OperatorChain) and use only text fields (no date or non-text) to see what results you obtain?

Also, I remember now there was an issue with the text operators reading from SQL databases:
http://rapid-i.com/rapidforum/index.php/topic,19.0.html

<We fixed the Text plugin and uploaded a new version at:>

Windows Installer: http://rapid-i.com/snapshot/rapidminer-text-4.1-installer.exe

Good luck.

jdouet

Hi B.,

About ExampleSetWriter, it was disabled, I use that kind of option instead of deleting it...

here is my new esxperiment :
<operator name="travail_sur_dump_evoblogs" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource" breakpoints="after">
<description text="voir problème des types text/nominal :côté SQL avec charset, cast et convertcôté RM en bricolant l'AML, et en demandant à Ingo"/>
<parameter key="database_url" value="jdbc:mysql://localhost:3306/installer0018843"/>
<parameter key="id_attribute" value="ID"/>
<parameter key="password" value="..."/>
<parameter key="query" value="select ID, post_content, post_title from evo_posts;"/>
<parameter key="username" value="root"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="create_text_visualizer" value="true"/>
<parameter key="default_content_encoding" value="windows-1252"/>
<parameter key="default_content_language" value="french"/>
<parameter key="default_content_type" value="html"/>
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<parameter key="vector_creation" value="TermOccurrences"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
</operator>

I have launched the experiment, but I still have nominal attributes.
Now I am going to have a look at the bugfix, thank you

!

Cheers,
Jean-Charles.

jdouet

Ok, Hello all again...

I have used the corrected plugin : behavior and results are different, but a problem still remains. Here is my experiment :
<operator name="travail_sur_dump_evoblogs" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<description text="blablabla"/>
<parameter key="database_url" value="jdbc:mysql://localhost:3306/installer0018843"/>
<parameter key="id_attribute" value="ID"/>
<parameter key="label_attribute" value="cat_name"/>
<parameter key="password" value="---"/>
<parameter key="query" value="select evo_posts.ID, evo_posts.post_issue_date, evo_posts.post_content, evo_posts.post_title, evo_categories.cat_name, evo_categories.cat_blog_ID from evo_posts, evo_categories where evo_posts.post_category=evo_categories.cat_ID and ID < 20;"/>
<parameter key="username" value="root"/>
</operator>
<operator name="ChangeAttributeRole (2)" class="ChangeAttributeRole">
<parameter key="name" value="post_issue_date"/>
<parameter key="target_role" value="id"/>
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole" breakpoints="after">
<parameter key="name" value="cat_blog_ID"/>
<parameter key="target_role" value="batch"/>
</operator>
<operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
<parameter key="attribute_name_regex" value="post_content|post_title"/>
<parameter key="deliver_inner_results" value="true"/>
<operator name="StringTextInput (2)" class="StringTextInput" expanded="yes">
<parameter key="create_text_visualizer" value="true"/>
<parameter key="default_content_encoding" value="windows-1252"/>
<parameter key="default_content_language" value="french"/>
<parameter key="default_content_type" value="html"/>
<parameter key="filter_nominal_attributes" value="true"/>
<parameter key="id_attribute_type" value="short"/>
<list key="namespaces">
</list>
<parameter key="vector_creation" value="TermFrequency"/>
<operator name="StringTokenizer (2)" class="StringTokenizer">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="SnowballStemmer" class="SnowballStemmer">
</operator>
</operator>
</operator>
</operator>

Now, before and after the breakpoint :
- the "batch" attribute disappears ...? If I create a "label_2" attribute type it disappears too !
- If I activate "extend exampleset", there is a strange behaviour, where all vectors are NULL, and old attributes from before vectorization remain.

Is that normal, doctor ?

Cheers,
Jean-Charles.

IngoRM

Hi,

about the special attributes which got lost: I think there is an option like "append_to_example_set" or "extend_example_set" or something similar. I think this parameter was added in order to keep the former attributes (at least the id attribute but probably also the others like batch etc.).

Cheers,
Ingo

jdouet

B. wrote:

Tobias

Have you determined what the problem is with ID not carrying through the process? thanks

@B. I realized that I have the same problem than yours...
@Ingo : I deactivated "extend exampleset", I have understood why all my vectors are flat (!!)

I sum up : I have lost "batch" or equivalents, and my IDs have been modified...

Cheers,
Jean-Charles.

Jean-Charles, Ingo

The problem is probably in the STI operator and how it handles ID attributes.

I set ID_attribute_type to short and long, and the text fields from my SQL records were merged into one field and used as the ID in place of a number generated by STI.

When I select one text field from the database, only that field is used as the ID. So if I have several words or a sentence those words or sentence become the ID values.

I suggest expanding the functionality of STI to include a fourth type of ID, pass-through or external ID that is passed into STI and not altered. Then we can match RM output back to original source data.

B.

jdouet

Hi B.,

I have the same behaviour : the "post_content" field becomes the ID field !! To overcome it, I have to reload the original table, and "ExampleJoin" it with the vector table...

Cheers,
Jean-Charles.

IngoRM

Hello all,

I see. I have added this to our todo list and we will try to incorporate this into the next release which will probably come next week.

Cheers and thanks for pointing this out,
Ingo

IngoRM

Hi again,

I just wanted to let you know that we just made a new release of RapidMiner (version: 4.2). The links to the new version will be available in a few hours on our web site. It also contains a bugfix for the Id problem.

Cheers,
Ingo