Find more posts tagged with
Sort by:
1 - 4 of
41
Neil
I learned how to do text analytics after watching your vedio. It's nice for English. But for Chinese, I don't know how to tokenize. For example, here is one Chinese sentence 这是一个关于如何实现文本分析的视频. The problem is, in natual, there is no blank or other non-letter character to separate the sentence. If I translate it to English, it is: "这(This)是(is)一个(a)关于(about)如何(how to)实现(realize)文本(text)分析(analytics)的()视频(video)" of course the sentence should be re-ordered. The operator "Generate n-Grams (Characters)" does not work for Chinese. It create a lot of n-grams items but most of them are meaningless. I'm new to this area, not sure whether I did it correctly or not. I simply process the document by extract content and generate n-grams. If you have some real example to share, that's wonderful.
I learned how to do text analytics after watching your vedio. It's nice for English. But for Chinese, I don't know how to tokenize. For example, here is one Chinese sentence 这是一个关于如何实现文本分析的视频. The problem is, in natual, there is no blank or other non-letter character to separate the sentence. If I translate it to English, it is: "这(This)是(is)一个(a)关于(about)如何(how to)实现(realize)文本(text)分析(analytics)的()视频(video)" of course the sentence should be re-ordered. The operator "Generate n-Grams (Characters)" does not work for Chinese. It create a lot of n-grams items but most of them are meaningless. I'm new to this area, not sure whether I did it correctly or not. I simply process the document by extract content and generate n-grams. If you have some real example to share, that's wonderful.
seems to work on my end
try this
replace the question mark characters with your chinese text in the create document operator, that's just how it turned out in rapidminer xml
try this
replace the question mark characters with your chinese text in the create document operator, that's just how it turned out in rapidminer xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>it splits the document into all of its characters, and produces a wordlist with those characters, and creates an exampleset with those characters as well
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<process expanded="true" height="145" width="145">
<operator activated="true" class="text:create_document" compatibility="5.1.002" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
<parameter key="text" value="?????????????????"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.002" expanded="true" height="94" name="Process Documents" width="90" x="192" y="30">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<process expanded="true" height="449" width="658">
<operator activated="true" class="text:generate_n_grams_characters" compatibility="5.1.002" expanded="true" height="60" name="Generate n-Grams (Characters)" width="90" x="179" y="30">
<parameter key="length" value="1"/>
</operator>
<connect from_port="document" to_op="Generate n-Grams (Characters)" to_port="document"/>
<connect from_op="Generate n-Grams (Characters)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
I tried the code and noticed that you use the operator "Generate n-Grams (Characters)" with parameter "length" equal to 1. This may work for English as it's meaningful in English for each individual characters. But for Chinese, individual character may mean nothing. Take this example. I have the sample text as "这是一句范文". I translate it to English as "This is a example". If you take each character out from the English, that's correct. However. The Chinese sentence should be tokenized as "这", "是", "一句" and "范文". some of the two characters should be categorized together.
这 -> This
是 -> is
一句 -> a
范文 -> example
Furthermore, there is no rule for how many characters should be categorized together, fully depends on the context. Or even the same characters can be categorized differently in different contexts.
这 -> This
是 -> is
一句 -> a
范文 -> example
Furthermore, there is no rule for how many characters should be categorized together, fully depends on the context. Or even the same characters can be categorized differently in different contexts.
Please let us know your results!