StringTextInput discards original ID values, replaces with different values
B_
New Altair Community Member
I use custom ID values for each record in my database. StringTextInput discards these ID values and the ID attribute name and inserts its own attribute name and Id value.
Is there a way to keep my original ID values? I notice the new ID values do not match the original records and I can not match my records to cluster results. See step 3 in the screen capture to see how these change.
Is there a way to keep my original ID values? I notice the new ID values do not match the original records and I can not match my records to cluster results. See step 3 in the screen capture to see how these change.
<operator name="Root" class="Process" expanded="yes">[attachment deleted by admin]
<description text="#ylt#h3#ygt#Specifying texts by an example
set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for
setting up the directories from which the text documents are read. Sometimes, however, a
more flexible solution is needed. If, for instance, your text documents have different types
of encoding or are written in different languages, you might wish to provide this
information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by
using an example set that contains one row for each input directory and corresponding
attributes for source, encoding, type and class. If such an example set is provided, the
texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/xxx"/>
<parameter key="id_attribute" value="RecID"/>
<parameter key="password" value="qqqqq"/>
<parameter key="query" value="SELECT * FROM [tblGolfTest]"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="ExampleVisualizer (Step1)" class="ExampleVisualizer"
breakpoints="before">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
<operator name="ExampleVisualizer (Step2)" class="ExampleVisualizer"
breakpoints="before">
</operator>
Tagged:
0
Answers
-
Hi,
I just checked and reproduced your problem. The StringTextInput-Operator seems to simply create a ID new attribute instead of keeping the old one. As I am not that familiar with the text plugin and its implementation I do not know whether this is intended or if this is an accidental implementation artefact. I assume the latter reason is the case. I will check this and post again when I know more about this issue.
Regards,
Tobias0 -
Tobias,
If the original ID is discarded there is no easy way to link results back to original data. I think this is an error
B..0 -
Tobias
Have you determined what the problem is with ID not carrying through the process? thanks0 -
Hi Tobias, Hi All,
I have a problem both with DatabaseExampleSource and with StringTextInput. I have PHP/MySQL blogs, and I caught the dump/backup files to import them on my local mySQL server.
Then I constructed my SQL query, but expecting that "post_content" and "post_title" should be of "string" type. On the original DB they were "text" but after importing they are "nominal"; what can I do ?
I have used "filter nominal attributes" but it is refused since there is no string attribute in my resulting Exampleset.
Cheers,
Jean-Charles.0 -
Jean-Charles
For StringTextInput, set filter_nominal_attributes to ON/TRUE. I am able to get text to into the STI operator from my SQL database. However, the record ID is discarded in STI, and you will not be able to match results back to the original data in your SQL database.
HTH
B.
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/database"/>
<parameter key="id_attribute" value="RecID"/>
<parameter key="password" value="zzz"/>
<parameter key="query" value="SELECT "/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>0 -
Jean-Charles
I forgot to mention your SQL query in DBExampleSource will be pulling the text fields from your database.
<parameter key="query" value="SELECT TextField1, TextField2, ....... From Table"/>
I haven't mixed text and other data types such as numeric or dates so I can't tell you what will happen.
B.0 -
Hi B.,
I have tried with "filter nominal attributes" : nothing...
Here is my experiment :
<operator name="travail_sur_dump_evoblogs" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource" breakpoints="after">
<description text="voir problème des types text/nominal :côté SQL avec charset, cast et convertcôté RM en bricolant l'AML, et en demandant à Ingo"/>
<parameter key="database_url" value="jdbc:mysql://localhost:3306/installer0018843"/>
<parameter key="id_attribute" value="ID"/>
<parameter key="label_attribute" value="post_category"/>
<parameter key="password" value="dummy"/>
<parameter key="query" value="select ID, post_issue_date, post_content, post_title, post_category from evo_posts;"/>
<parameter key="table_name" value="evo_posts"/>
<parameter key="username" value="root"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter" activated="no">
<parameter key="attribute_description_file" value="C:\Documents and Settings\JCD\Bureau\outils de recherche\analyses statistiques\analyse_site\table_posts_blogs.aml"/>
<parameter key="example_set_file" value="C:\Documents and Settings\JCD\Bureau\outils de recherche\analyses statistiques\analyse_site\data\table_posts_blogs.dat"/>
</operator>
<operator name="texte" class="OperatorChain" expanded="yes">
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="create_text_visualizer" value="true"/>
<parameter key="default_content_encoding" value="windows-1252"/>
<parameter key="default_content_language" value="french"/>
<parameter key="default_content_type" value="html"/>
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<parameter key="vector_creation" value="TermOccurrences"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
</operator>
</operator>
"post_title" and "post_content" are "text" types, while there are "enum(published, private)" types that had rather be of nominal type. But it does not seem to work this way...
Jean-Charles.
PS : I have used "GuessValueType" and nothing...It would be interesting that nominal attributes for which each value contains blank spaces should be recognized as "string", shouldn't it ?0 -
Jean-Charles
I notice your process structure is a little different from mine.
You read data from mySQL and save it with ExampleWriter then continue to an OperatorChain with StringTextInput. You do not read your example back into the process. You also have a date type (post_issue_date) that you select.
DBExampleSource
ExampleSetWriter
OperaterChain (Text)
STI
StringTokenizer
My structure is more direct and I don't have date or non-text fields.
DBExampleSource
STI
StringTokenizer
These are the only differences I see between the two processes. Can you rearrange your process to match mine (leave out ExampleWriter and don't put STI and Tokenizer in an OperatorChain) and use only text fields (no date or non-text) to see what results you obtain?
Also, I remember now there was an issue with the text operators reading from SQL databases:
http://rapid-i.com/rapidforum/index.php/topic,19.0.html
<We fixed the Text plugin and uploaded a new version at:>
Windows Installer: http://rapid-i.com/snapshot/rapidminer-text-4.1-installer.exe
Good luck.
0 -
Hi B.,
About ExampleSetWriter, it was disabled, I use that kind of option instead of deleting it...
here is my new esxperiment :
<operator name="travail_sur_dump_evoblogs" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource" breakpoints="after">
<description text="voir problème des types text/nominal :côté SQL avec charset, cast et convertcôté RM en bricolant l'AML, et en demandant à Ingo"/>
<parameter key="database_url" value="jdbc:mysql://localhost:3306/installer0018843"/>
<parameter key="id_attribute" value="ID"/>
<parameter key="password" value="..."/>
<parameter key="query" value="select ID, post_content, post_title from evo_posts;"/>
<parameter key="username" value="root"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="create_text_visualizer" value="true"/>
<parameter key="default_content_encoding" value="windows-1252"/>
<parameter key="default_content_language" value="french"/>
<parameter key="default_content_type" value="html"/>
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<parameter key="vector_creation" value="TermOccurrences"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
</operator>
I have launched the experiment, but I still have nominal attributes.
Now I am going to have a look at the bugfix, thank you !
Cheers,
Jean-Charles.0 -
Ok, Hello all again...
I have used the corrected plugin : behavior and results are different, but a problem still remains. Here is my experiment :
<operator name="travail_sur_dump_evoblogs" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<description text="blablabla"/>
<parameter key="database_url" value="jdbc:mysql://localhost:3306/installer0018843"/>
<parameter key="id_attribute" value="ID"/>
<parameter key="label_attribute" value="cat_name"/>
<parameter key="password" value="---"/>
<parameter key="query" value="select evo_posts.ID, evo_posts.post_issue_date, evo_posts.post_content, evo_posts.post_title, evo_categories.cat_name, evo_categories.cat_blog_ID from evo_posts, evo_categories where evo_posts.post_category=evo_categories.cat_ID and ID < 20;"/>
<parameter key="username" value="root"/>
</operator>
<operator name="ChangeAttributeRole (2)" class="ChangeAttributeRole">
<parameter key="name" value="post_issue_date"/>
<parameter key="target_role" value="id"/>
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole" breakpoints="after">
<parameter key="name" value="cat_blog_ID"/>
<parameter key="target_role" value="batch"/>
</operator>
<operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
<parameter key="attribute_name_regex" value="post_content|post_title"/>
<parameter key="deliver_inner_results" value="true"/>
<operator name="StringTextInput (2)" class="StringTextInput" expanded="yes">
<parameter key="create_text_visualizer" value="true"/>
<parameter key="default_content_encoding" value="windows-1252"/>
<parameter key="default_content_language" value="french"/>
<parameter key="default_content_type" value="html"/>
<parameter key="filter_nominal_attributes" value="true"/>
<parameter key="id_attribute_type" value="short"/>
<list key="namespaces">
</list>
<parameter key="vector_creation" value="TermFrequency"/>
<operator name="StringTokenizer (2)" class="StringTokenizer">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="SnowballStemmer" class="SnowballStemmer">
</operator>
</operator>
</operator>
</operator>
Now, before and after the breakpoint :
- the "batch" attribute disappears ...? If I create a "label_2" attribute type it disappears too !
- If I activate "extend exampleset", there is a strange behaviour, where all vectors are NULL, and old attributes from before vectorization remain.
Is that normal, doctor ?
Cheers,
Jean-Charles.0 -
Hi,
about the special attributes which got lost: I think there is an option like "append_to_example_set" or "extend_example_set" or something similar. I think this parameter was added in order to keep the former attributes (at least the id attribute but probably also the others like batch etc.).
Cheers,
Ingo0 -
@B. I realized that I have the same problem than yours...B. wrote:
Tobias
Have you determined what the problem is with ID not carrying through the process? thanks
@Ingo : I deactivated "extend exampleset", I have understood why all my vectors are flat (!!)
I sum up : I have lost "batch" or equivalents, and my IDs have been modified...
Cheers,
Jean-Charles.0 -
Jean-Charles, Ingo
The problem is probably in the STI operator and how it handles ID attributes.
I set ID_attribute_type to short and long, and the text fields from my SQL records were merged into one field and used as the ID in place of a number generated by STI.
When I select one text field from the database, only that field is used as the ID. So if I have several words or a sentence those words or sentence become the ID values.
I suggest expanding the functionality of STI to include a fourth type of ID, pass-through or external ID that is passed into STI and not altered. Then we can match RM output back to original source data.
B.
0 -
Hi B.,
I have the same behaviour : the "post_content" field becomes the ID field !! To overcome it, I have to reload the original table, and "ExampleJoin" it with the vector table...
Cheers,
Jean-Charles.0 -
Hello all,
I see. I have added this to our todo list and we will try to incorporate this into the next release which will probably come next week.
Cheers and thanks for pointing this out,
Ingo0 -
Hi again,
I just wanted to let you know that we just made a new release of RapidMiner (version: 4.2). The links to the new version will be available in a few hours on our web site. It also contains a bugfix for the Id problem.
Cheers,
Ingo0