Tonkenize on Chinese

xiaobo_sxb · October 2011

Does anybody know how to tokenize Chinese (or Japanese, Korean etc). The current operator in text processing extension works for English quite well but does not work for Chinese.

Steven

el_chief · October 2011

You probably want to use Text Processing > Transformation > Generate n-Grams (Characters) for Chinese or other Oriental languages.

Please let us know your results!

xiaobo_sxb · October 2011

Neil

I learned how to do text analytics after watching your vedio. It's nice for English. But for Chinese, I don't know how to tokenize. For example, here is one Chinese sentence 这是一个关于如何实现文本分析的视频. The problem is, in natual, there is no blank or other non-letter character to separate the sentence. If I translate it to English, it is: "这(This)是(is)一个(a)关于(about)如何(how to)实现(realize)文本(text)分析(analytics)的()视频(video)" of course the sentence should be re-ordered. The operator "Generate n-Grams (Characters)" does not work for Chinese. It create a lot of n-grams items but most of them are meaningless. I'm new to this area, not sure whether I did it correctly or not. I simply process the document by extract content and generate n-grams. If you have some real example to share, that's wonderful.

el_chief · October 2011

seems to work on my end

try this

replace the question mark characters with your chinese text in the create document operator, that's just how it turned out in rapidminer xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <process expanded="true" height="145" width="145">
      <operator activated="true" class="text:create_document" compatibility="5.1.002" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
        <parameter key="text" value="?????????????????"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.1.002" expanded="true" height="94" name="Process Documents" width="90" x="192" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true" height="449" width="658">
          <operator activated="true" class="text:generate_n_grams_characters" compatibility="5.1.002" expanded="true" height="60" name="Generate n-Grams (Characters)" width="90" x="179" y="30">
            <parameter key="length" value="1"/>
          </operator>
          <connect from_port="document" to_op="Generate n-Grams (Characters)" to_port="document"/>
          <connect from_op="Generate n-Grams (Characters)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

it splits the document into all of its characters, and produces a wordlist with those characters, and creates an exampleset with those characters as well

xiaobo_sxb · October 2011

I tried the code and noticed that you use the operator "Generate n-Grams (Characters)" with parameter "length" equal to 1. This may work for English as it's meaningful in English for each individual characters. But for Chinese, individual character may mean nothing. Take this example. I have the sample text as "这是一句范文". I translate it to English as "This is a example". If you take each character out from the English, that's correct. However. The Chinese sentence should be tokenized as "这", "是", "一句" and "范文". some of the two characters should be categorized together.
这 -> This
是 -> is
一句 -> a
范文 -> example
Furthermore, there is no rule for how many characters should be categorized together, fully depends on the context. Or even the same characters can be categorized differently in different contexts.

Tonkenize on Chinese

Answers

Categories