🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Tonkenize on Chinese

User: "xiaobo_sxb"
New Altair Community Member
Updated by Jocelyn
Does anybody know how to tokenize Chinese (or Japanese, Korean etc). The current operator in text processing extension works for English quite well but does not work for Chinese.

Steven

Find more posts tagged with

Sort by:
1 - 4 of 41
    User: "el_chief"
    New Altair Community Member
    You probably want to use Text Processing > Transformation > Generate n-Grams (Characters) for Chinese or other Oriental languages.

    Please let us know your results! :)
    User: "xiaobo_sxb"
    New Altair Community Member
    OP
    Neil

    I learned how to do text analytics after watching your vedio. It's nice for English. But for Chinese, I don't know how to tokenize. For example, here is one Chinese sentence 这是一个关于如何实现文本分析的视频. The problem is, in natual, there is no blank or other non-letter character to separate the sentence. If I translate it to English, it is: "这(This)是(is)一个(a)关于(about)如何(how to)实现(realize)文本(text)分析(analytics)的()视频(video)"  of course the sentence should be re-ordered. The operator "Generate n-Grams (Characters)" does not work for Chinese. It create a lot of n-grams items but most of them are meaningless. I'm new to this area, not sure whether I did it correctly or not. I simply process the document by extract content and generate n-grams. If you have some real example to share, that's wonderful.
    User: "el_chief"
    New Altair Community Member
    seems to work on my end

    try this

    replace the question mark characters with your chinese text in the create document operator, that's just how it turned out in rapidminer xml
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.011">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
       <process expanded="true" height="145" width="145">
         <operator activated="true" class="text:create_document" compatibility="5.1.002" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
           <parameter key="text" value="?????????????????"/>
         </operator>
         <operator activated="true" class="text:process_documents" compatibility="5.1.002" expanded="true" height="94" name="Process Documents" width="90" x="192" y="30">
           <parameter key="vector_creation" value="Term Occurrences"/>
           <parameter key="add_meta_information" value="false"/>
           <parameter key="keep_text" value="true"/>
           <process expanded="true" height="449" width="658">
             <operator activated="true" class="text:generate_n_grams_characters" compatibility="5.1.002" expanded="true" height="60" name="Generate n-Grams (Characters)" width="90" x="179" y="30">
               <parameter key="length" value="1"/>
             </operator>
             <connect from_port="document" to_op="Generate n-Grams (Characters)" to_port="document"/>
             <connect from_op="Generate n-Grams (Characters)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
         <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
         <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
    it splits the document into all of its characters, and produces a wordlist with those characters, and creates an exampleset with those characters as well
    User: "xiaobo_sxb"
    New Altair Community Member
    OP
    I tried the code and noticed that you use the operator "Generate n-Grams (Characters)" with parameter "length" equal to 1. This may work for English as it's meaningful in English for each individual characters. But for Chinese, individual character may mean nothing. Take this example. I have the sample text as "这是一句范文". I translate it to English as "This is a example". If you take each character out from the English, that's correct. However. The Chinese sentence should be tokenized as "这", "是", "一句" and "范文". some of the two characters should be categorized together.
    这 -> This
    是 -> is
    一句 -> a
    范文 -> example
    Furthermore, there is no rule for how many characters should be categorized together, fully depends on the context. Or even the same characters can be categorized differently in different contexts.