Tonkenize on Chinese
xiaobo_sxb
New Altair Community Member
Does anybody know how to tokenize Chinese (or Japanese, Korean etc). The current operator in text processing extension works for English quite well but does not work for Chinese.
Steven
Steven
Tagged:
0
Answers
-
You probably want to use Text Processing > Transformation > Generate n-Grams (Characters) for Chinese or other Oriental languages.
Please let us know your results!0 -
Neil
I learned how to do text analytics after watching your vedio. It's nice for English. But for Chinese, I don't know how to tokenize. For example, here is one Chinese sentence 这是一个关于如何实现文本分析的视频. The problem is, in natual, there is no blank or other non-letter character to separate the sentence. If I translate it to English, it is: "这(This)是(is)一个(a)关于(about)如何(how to)实现(realize)文本(text)分析(analytics)的()视频(video)" of course the sentence should be re-ordered. The operator "Generate n-Grams (Characters)" does not work for Chinese. It create a lot of n-grams items but most of them are meaningless. I'm new to this area, not sure whether I did it correctly or not. I simply process the document by extract content and generate n-grams. If you have some real example to share, that's wonderful.0 -
seems to work on my end
try this
replace the question mark characters with your chinese text in the create document operator, that's just how it turned out in rapidminer xml<?xml version="1.0" encoding="UTF-8" standalone="no"?>
it splits the document into all of its characters, and produces a wordlist with those characters, and creates an exampleset with those characters as well
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<process expanded="true" height="145" width="145">
<operator activated="true" class="text:create_document" compatibility="5.1.002" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
<parameter key="text" value="?????????????????"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.002" expanded="true" height="94" name="Process Documents" width="90" x="192" y="30">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<process expanded="true" height="449" width="658">
<operator activated="true" class="text:generate_n_grams_characters" compatibility="5.1.002" expanded="true" height="60" name="Generate n-Grams (Characters)" width="90" x="179" y="30">
<parameter key="length" value="1"/>
</operator>
<connect from_port="document" to_op="Generate n-Grams (Characters)" to_port="document"/>
<connect from_op="Generate n-Grams (Characters)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
0 -
I tried the code and noticed that you use the operator "Generate n-Grams (Characters)" with parameter "length" equal to 1. This may work for English as it's meaningful in English for each individual characters. But for Chinese, individual character may mean nothing. Take this example. I have the sample text as "这是一句范文". I translate it to English as "This is a example". If you take each character out from the English, that's correct. However. The Chinese sentence should be tokenized as "这", "是", "一句" and "范文". some of the two characters should be categorized together.
这 -> This
是 -> is
一句 -> a
范文 -> example
Furthermore, there is no rule for how many characters should be categorized together, fully depends on the context. Or even the same characters can be categorized differently in different contexts.0