"Stem (Dictionary) Indonesia Language with regex"

baybay
baybay New Altair Community Member
edited November 5 in Community Q&A

Hello,

 

I have a problem when trying to use regex for Stem (Dictionary) Indonesia language
This is for example indonesian language:

 

saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus

 

 

and I want to make it as below:

 

saya sangat senang dengan kalian, tampil dan suara sangat bagus

 

 

That is working when I used stem like this:

 

 

kalian:kalian.*
tampil:tampil.*
suara:suara.*

 

 

But failed, when I'am trying to used another regex function:

 

 

 :-(.*)$
:(ku|mu|nya|lah|kah|tah|pun)$

 

 

How can I used stem, besides with function "text: text. *"

Please help me for this case :)


Thanks

Best Regards,

 

Bay

Answers

  • sgenzer
    sgenzer
    Altair Employee

    hello @baybay - hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression 

    -(.*)$ 

     the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru @Telcontar120 :)

     

    Scott

     

  • YYH
    YYH
    Altair Employee

    Hi @baybay,

    You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
    <parameter key="text" value="saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus&#10;"/>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
    <parameter key="mode" value="specify characters"/>
    <parameter key="characters" value=", "/>
    </operator>
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="dictionary&#10;kalian:kalian.*&#10;tampil:tampil.*&#10;suara:suara.*"/>
    </operator>
    <operator activated="true" class="operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">
    <parameter key="attribute" value="dictionary"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Stem Tokens Using ExampleSet" to_port="document"/>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Stem Tokens Using ExampleSet" to_port="example set"/>
    <connect from_op="Stem Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    A comprehensive study of stemming on Indonesia 

    https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

     

    HTH,

    YY

  • baybay
    baybay New Altair Community Member

    @sgenzer wrote:

    hello @baybay - hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression 

    -(.*)$ 

     the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru @Telcontar120 :)

     

    Scott

     


    Hi @sgenzer,

     

    I sent by attachment for dataset, XML and stemming

     

    Thanks

    Bay

  • baybay
    baybay New Altair Community Member
    @yyhuang wrote:

    Hi @baybay,

    You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
    <parameter key="text" value="saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus&#10;"/>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
    <parameter key="mode" value="specify characters"/>
    <parameter key="characters" value=", "/>
    </operator>
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="dictionary&#10;kalian:kalian.*&#10;tampil:tampil.*&#10;suara:suara.*"/>
    </operator>
    <operator activated="true" class="operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">
    <parameter key="attribute" value="dictionary"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Stem Tokens Using ExampleSet" to_port="document"/>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Stem Tokens Using ExampleSet" to_port="example set"/>
    <connect from_op="Stem Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    A comprehensive study of stemming on Indonesia 

    https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

     

    HTH,

    YY


    Hi @yyhuang,

     

    So we must input stem text one by one like "suara:suara.*"?

    I just want to make automaticaly remove stem text like on this link

     

    Thanks

    Bay