🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Stem (Dictionary) Indonesia Language with regex"

User: "baybay"
New Altair Community Member
Updated by Jocelyn

Hello,

 

I have a problem when trying to use regex for Stem (Dictionary) Indonesia language
This is for example indonesian language:

 

saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus

 

 

and I want to make it as below:

 

saya sangat senang dengan kalian, tampil dan suara sangat bagus

 

 

That is working when I used stem like this:

 

 

kalian:kalian.*
tampil:tampil.*
suara:suara.*

 

 

But failed, when I'am trying to used another regex function:

 

 

 :-(.*)$
:(ku|mu|nya|lah|kah|tah|pun)$

 

 

How can I used stem, besides with function "text: text. *"

Please help me for this case :)


Thanks

Best Regards,

 

Bay

Find more posts tagged with

Sort by:
1 - 4 of 41
    User: "sgenzer"
    Altair Employee

    hello @baybay - hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression 

    -(.*)$ 

     the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru @Telcontar120 :)

     

    Scott

     

    User: "YYH"
    Altair Employee

    Hi @baybay,

    You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
    <parameter key="text" value="saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus&#10;"/>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
    <parameter key="mode" value="specify characters"/>
    <parameter key="characters" value=", "/>
    </operator>
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="dictionary&#10;kalian:kalian.*&#10;tampil:tampil.*&#10;suara:suara.*"/>
    </operator>
    <operator activated="true" class="operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">
    <parameter key="attribute" value="dictionary"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Stem Tokens Using ExampleSet" to_port="document"/>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Stem Tokens Using ExampleSet" to_port="example set"/>
    <connect from_op="Stem Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    A comprehensive study of stemming on Indonesia 

    https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

     

    HTH,

    YY

    User: "baybay"
    New Altair Community Member
    OP

    @sgenzer wrote:

    hello @baybay - hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression 

    -(.*)$ 

     the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru @Telcontar120 :)

     

    Scott

     


    Hi @sgenzer,

     

    I sent by attachment for dataset, XML and stemming

     

    Thanks

    Bay

    User: "baybay"
    New Altair Community Member
    OP
    @yyhuang wrote:

    Hi @baybay,

    You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
    <parameter key="text" value="saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus&#10;"/>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
    <parameter key="mode" value="specify characters"/>
    <parameter key="characters" value=", "/>
    </operator>
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="dictionary&#10;kalian:kalian.*&#10;tampil:tampil.*&#10;suara:suara.*"/>
    </operator>
    <operator activated="true" class="operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">
    <parameter key="attribute" value="dictionary"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Stem Tokens Using ExampleSet" to_port="document"/>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Stem Tokens Using ExampleSet" to_port="example set"/>
    <connect from_op="Stem Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    A comprehensive study of stemming on Indonesia 

    https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

     

    HTH,

    YY


    Hi @yyhuang,

     

    So we must input stem text one by one like "suara:suara.*"?

    I just want to make automaticaly remove stem text like on this link

     

    Thanks

    Bay