Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
StopwordfilterFile
nguyenxuanhau
Im using operator StopwordFilterFile but this operator don't work with many stop word as : với, ới, tời, đỗ
my file xml as following:
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.6">
<operator name="Root" class="Process" expanded="yes">
<description text="Text Hau"/>
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="graphics" value="../../data/dulieu"/>
</list>
<parameter key="default_content_type" value=""/>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value=""/>
<parameter key="prune_below" value="1"/>
<parameter key="prune_above" value="-1"/>
<parameter key="vector_creation" value="TFIDF"/>
<parameter key="use_content_attributes" value="false"/>
<parameter key="use_given_word_list" value="false"/>
<parameter key="return_word_list" value="false"/>
<parameter key="id_attribute_type" value="short"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="false"/>
<parameter key="on_the_fly_pruning" value="-1"/>
<parameter key="extend_exampleset" value="false"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="../../data/dulieu/stopword/stopword.dat"/>
<parameter key="case_sensitive" value="false"/>
</operator>
</operator>
</operator>
</process>
The stopword file contains stop words one per line.
to use operator StopwordFilterFile, what do i do?
Greetings!
Find more posts tagged with
AI Studio
Accepted answers
All comments
haddock
Hi there,
Thanks for posting the process, however most folks now use version 5 and will not be able to load it. Upgrade to commune!
As to your problem, my guess is that it is about the
characters
in those words, and whether their encoding is correctly set, both in Rapidminer and in the stopword file ( I notice you use both windows-1252 and UTF-8 in your Rapidminer XML ). There are also problems specific to Vietnamese detailed here
http://vietunicode.sourceforge.net/main.html
. Obviously if letters are differently portrayed texts will not match, but if they are portrayed using the same format throughout then I'd need to look into the source.
Which I don't have, because the Text plugin has also been updated!
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups