Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

"Stem (Dictionary) Indonesia Language with regex"

Hello,

I have a problem when trying to use regex for Stem (Dictionary) Indonesia language
This is for example indonesian language:

saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus

and I want to make it as below:

saya sangat senang dengan kalian, tampil dan suara sangat bagus

That is working when I used stem like this:

kalian:kalian.*
tampil:tampil.*
suara:suara.*

But failed, when I'am trying to used another regex function:

 :-(.*)$
 :(ku|mu|nya|lah|kah|tah|pun)$

How can I used stem, besides with function "text: text. *"

Please help me for this case

Thanks

Best Regards,

Bay

Find more posts tagged with

AI Studio

RegEx

Text Mining + NLP

Accepted answers

All comments

sgenzer

hello @baybay - hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression

-(.*)$

the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru @Telcontar120

Scott

YYH

Hi @baybay,

You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
        <parameter key="text" value="saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus&#10;"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=", "/>
          </operator>
          <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">
            <parameter key="generator_type" value="comma_separated_text"/>
            <list key="function_descriptions"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="input_csv_text" value="dictionary&#10;kalian:kalian.*&#10;tampil:tampil.*&#10;suara:suara.*"/>
          </operator>
          <operator activated="true" class="operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">
            <parameter key="attribute" value="dictionary"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Stem Tokens Using ExampleSet" to_port="document"/>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Stem Tokens Using ExampleSet" to_port="example set"/>
          <connect from_op="Stem Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

A comprehensive study of stemming on Indonesia

https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

HTH,

baybay

@sgenzer wrote:
hello @baybay - hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression
-(.*)$ 
the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru @Telcontar120

Scott

Hi @sgenzer,

I sent by attachment for dataset, XML and stemming

Thanks

Bay

Sentiment Analysis Twitter.zip

baybay

@yyhuang wrote:

Hi @baybay,

You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
        <parameter key="text" value="saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus&#10;"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=", "/>
          </operator>
          <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">
            <parameter key="generator_type" value="comma_separated_text"/>
            <list key="function_descriptions"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="input_csv_text" value="dictionary&#10;kalian:kalian.*&#10;tampil:tampil.*&#10;suara:suara.*"/>
          </operator>
          <operator activated="true" class="operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">
            <parameter key="attribute" value="dictionary"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Stem Tokens Using ExampleSet" to_port="document"/>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Stem Tokens Using ExampleSet" to_port="example set"/>
          <connect from_op="Stem Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

A comprehensive study of stemming on Indonesia

https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

HTH,

Hi @yyhuang,

So we must input stem text one by one like "suara:suara.*"?

I just want to make automaticaly remove stem text like on this link

Thanks

Bay