"Stem (Dictionary) Indonesia Language with regex"
Hello,
I have a problem when trying to use regex for Stem (Dictionary) Indonesia language
This is for example indonesian language:
saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus
and I want to make it as below:
saya sangat senang dengan kalian, tampil dan suara sangat bagus
That is working when I used stem like this:
kalian:kalian.*
tampil:tampil.*
suara:suara.*
But failed, when I'am trying to used another regex function:
:-(.*)$
:(ku|mu|nya|lah|kah|tah|pun)$
How can I used stem, besides with function "text: text. *"
Please help me for this case
Thanks
Best Regards,
Bay
Answers
-
hello @baybay - hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression
-(.*)$
the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru @Telcontar120
Scott
0 -
Hi @baybay,
You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
<parameter key="text" value="saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus "/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=", "/>
</operator>
<operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">
<parameter key="generator_type" value="comma_separated_text"/>
<list key="function_descriptions"/>
<list key="numeric_series_configuration"/>
<list key="date_series_configuration"/>
<list key="date_series_configuration (interval)"/>
<parameter key="input_csv_text" value="dictionary kalian:kalian.* tampil:tampil.* suara:suara.*"/>
</operator>
<operator activated="true" class="operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">
<parameter key="attribute" value="dictionary"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Stem Tokens Using ExampleSet" to_port="document"/>
<connect from_op="Create ExampleSet" from_port="output" to_op="Stem Tokens Using ExampleSet" to_port="example set"/>
<connect from_op="Stem Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>A comprehensive study of stemming on Indonesia
https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf
HTH,
YY
1 -
@sgenzer wrote:hello @baybay - hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression
-(.*)$
the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru @Telcontar120
Scott
Hi @sgenzer,
I sent by attachment for dataset, XML and stemming
Thanks
Bay
0 -
@yyhuang wrote:
Hi @baybay,
You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
<parameter key="text" value="saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus "/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=", "/>
</operator>
<operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">
<parameter key="generator_type" value="comma_separated_text"/>
<list key="function_descriptions"/>
<list key="numeric_series_configuration"/>
<list key="date_series_configuration"/>
<list key="date_series_configuration (interval)"/>
<parameter key="input_csv_text" value="dictionary kalian:kalian.* tampil:tampil.* suara:suara.*"/>
</operator>
<operator activated="true" class="operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">
<parameter key="attribute" value="dictionary"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Stem Tokens Using ExampleSet" to_port="document"/>
<connect from_op="Create ExampleSet" from_port="output" to_op="Stem Tokens Using ExampleSet" to_port="example set"/>
<connect from_op="Stem Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>A comprehensive study of stemming on Indonesia
https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf
HTH,
YY
Hi @yyhuang,
So we must input stem text one by one like "suara:suara.*"?
I just want to make automaticaly remove stem text like on this link
Thanks
Bay
0