Problem Mandarin Text mining - HanMiner
YoGVA
New Altair Community Member
Hi everyone,
I am a newbie here but here is my situation.
I need to conduct a qualitative content analysis of a large number of Chinese reports. However, Rapid Miner needs an extension to capture Chinese characters - I found one called Hanminer posted by another member.
I followed the instructions and installed the extension via Github; but the extension does not show up on RapidMiner ...
Any ideas to solve that issue? Or another was to text mine Chinese documents?
Any help would be much appreciated!
Yoyo
I am a newbie here but here is my situation.
I need to conduct a qualitative content analysis of a large number of Chinese reports. However, Rapid Miner needs an extension to capture Chinese characters - I found one called Hanminer posted by another member.
I followed the instructions and installed the extension via Github; but the extension does not show up on RapidMiner ...
Any ideas to solve that issue? Or another was to text mine Chinese documents?
Any help would be much appreciated!
Yoyo
Tagged:
1
Best Answer
-
Hi,
the third party HenMiner Extension has no option to define the encoding of the imported file, as a workaround you could use Macros:<?xml version="1.0" encoding="UTF-8"?><process version="10.1.002"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="10.1.002" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="UTF-8"/> <process expanded="true"> <operator activated="true" class="open_file" compatibility="10.1.002" expanded="true" height="68" name="Open File" width="90" x="112" y="34"> <parameter key="resource_type" value="URL"/> <parameter key="filename" value=""/> <parameter key="url" value="https://us.v-cdn.net/6030995/uploads/editor/sf/nq6mm23abhpa.txt"/> </operator> <operator activated="true" class="multiply" compatibility="10.1.002" expanded="true" height="103" name="Multiply" width="90" x="246" y="85"/> <operator activated="true" class="text:read_document" compatibility="10.0.000" expanded="true" height="68" name="Read Document (2)" width="90" x="380" y="34"> <parameter key="extract_text_only" value="true"/> <parameter key="use_file_extension_as_type" value="true"/> <parameter key="content_type" value="txt"/> <parameter key="encoding" value="UTF-8"/> </operator> <operator activated="true" class="text:documents_to_data" compatibility="10.0.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="514" y="34"> <parameter key="text_attribute" value="text"/> <parameter key="add_meta_information" value="false"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <parameter key="use_processed_text" value="false"/> </operator> <operator activated="true" class="extract_macro" compatibility="10.1.002" expanded="true" height="68" name="Extract Macro" width="90" x="648" y="34"> <parameter key="macro" value="text"/> <parameter key="macro_type" value="data_value"/> <parameter key="statistics" value="average"/> <parameter key="attribute_name" value="text"/> <parameter key="example_index" value="1"/> <list key="additional_macros"/> </operator> <operator activated="true" class="hanminer:read_document" compatibility="1.0.003" expanded="true" height="68" name="Read Document" width="90" x="782" y="136"> <parameter key="encoding" value="UTF-8"/> <parameter key="import_from_file" value="false"/> <parameter key="text" value="%{text}"/> <parameter key="file" value="C:/Users/Rui/Downloads/archive (6)-chinese/chinese-dataset-subset.txt"/> </operator> <operator activated="true" class="hanminer:tokenize" compatibility="1.0.003" expanded="true" height="68" name="Tokenize" width="90" x="916" y="136"> <parameter key="high_speed_mode" value="false"/> </operator> <connect from_op="Open File" from_port="file" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="Read Document (2)" to_port="file"/> <connect from_op="Multiply" from_port="output 2" to_op="Read Document" to_port="file"/> <connect from_op="Read Document (2)" from_port="output" to_op="Documents to Data (2)" to_port="documents 1"/> <connect from_op="Documents to Data (2)" from_port="example set" to_op="Extract Macro" to_port="example set"/> <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document set"/> <connect from_op="Tokenize" from_port="document set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Greetings,
Jonas0
Answers
-
Hi Scott,
Yes it is.
I'm trying to install the following but no success so far.
https /github.com/joeyhaohao/rapidminer-Hanminer
Nothing happens at step 4 when I try to install the extension.
I am also trying to look at other options but it is harder than I expected...
Any help would be great, cheers!
Yoyo0 -
Hi @YoGVA ,here is a compiled version of the github version, which you can just unzip and copy to .RapidMiner/extension. This works, but i have not tested the operators of course.Best,Martin
3 -
Thanks for sharing the compiled extension. Dr @mschmitz !
After installing manually by inserting unzipped .jar file into my local extension folder C:UsersYy.RapidMinerextensions and a restart, everything is working fine. Hi @YoGVA you can follow the instructions here https://community.rapidminer.com/discussion/50333/install-extensions-manually-for-rapidminer-studio
Six new operators added into the new extension folder "Text Miner"
A quick test on the news data looks reasonable.<pre class="CodeBlock"><code>
<?xml version="1.0" encoding="UTF-8"?><process version="9.5.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.5.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value="yhuang@rapidminer.com"/> <parameter key="process_duration_for_mail" value="1"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="text_miner:read_text" compatibility="1.0.000" expanded="true" height="68" name="Read Text" width="90" x="112" y="34"> <parameter key="encoding" value="SYSTEM"/> <parameter key="text" value=" 这是默认的文本 每年到了这个时候,市场经济学家都会发布对未来12个月的详细宏观预测。令我自己都讶异的是,我正在为进行这项困难尝试的第五个十年画上句号,到目前为止离完美的成功预测还差得很远。经济以及市场的重大动荡可不会整整齐齐地把自己挤进一个自然年。"/> <parameter key="import_from_file" value="false"/> </operator> <operator activated="true" class="text_miner:tokenization" compatibility="1.0.000" expanded="true" height="68" name="Tokenization" width="90" x="313" y="34"/> <operator activated="true" class="text_miner:filter_stopwords" compatibility="1.0.000" expanded="true" height="68" name="Filtering" width="90" x="514" y="34"/> <operator activated="true" class="text_miner:word_count" compatibility="1.0.000" expanded="true" height="68" name="Word Count" width="90" x="782" y="34"/> <connect from_op="Read Text" from_port="output" to_op="Tokenization" to_port="text"/> <connect from_op="Tokenization" from_port="text" to_op="Filtering" to_port="text"/> <connect from_op="Filtering" from_port="text" to_op="Word Count" to_port="text"/> <connect from_op="Word Count" from_port="example set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>1 -
Hi.
My apologies if I should open a new question. My question is related to the latest version of Hanminer v.1.0.3. I noticed that the READ TEXT operator is now named READ DOCUMENT.
My problem is when I import from file using this operator, the chinese characters became unidentified symbols.
I have tried several ways:
1. I tried using the different encodings listed and have installed chinese character in my windows pc but no difference.
2. I imported the dataset as an example set and used DATA TO DOCUMENTS operator as below. However, I received an error.
3. I tried connecting DATA TO DOCUMENTS operator to the READ DOCUMENT operator but this resulted in wrong input/output connection.
Perhaps, @yyhuang can help shed some light here. Really appreciate it.
Thank you kindly.
0 -
Hi,
have you tried to change encoding to UTF-8?
Greetings,
Jonas0 -
Hi Jonas,
Yes, I have, but still nothing.0 -
Hi,
the third party HenMiner Extension has no option to define the encoding of the imported file, as a workaround you could use Macros:<?xml version="1.0" encoding="UTF-8"?><process version="10.1.002"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="10.1.002" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="UTF-8"/> <process expanded="true"> <operator activated="true" class="open_file" compatibility="10.1.002" expanded="true" height="68" name="Open File" width="90" x="112" y="34"> <parameter key="resource_type" value="URL"/> <parameter key="filename" value=""/> <parameter key="url" value="https://us.v-cdn.net/6030995/uploads/editor/sf/nq6mm23abhpa.txt"/> </operator> <operator activated="true" class="multiply" compatibility="10.1.002" expanded="true" height="103" name="Multiply" width="90" x="246" y="85"/> <operator activated="true" class="text:read_document" compatibility="10.0.000" expanded="true" height="68" name="Read Document (2)" width="90" x="380" y="34"> <parameter key="extract_text_only" value="true"/> <parameter key="use_file_extension_as_type" value="true"/> <parameter key="content_type" value="txt"/> <parameter key="encoding" value="UTF-8"/> </operator> <operator activated="true" class="text:documents_to_data" compatibility="10.0.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="514" y="34"> <parameter key="text_attribute" value="text"/> <parameter key="add_meta_information" value="false"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <parameter key="use_processed_text" value="false"/> </operator> <operator activated="true" class="extract_macro" compatibility="10.1.002" expanded="true" height="68" name="Extract Macro" width="90" x="648" y="34"> <parameter key="macro" value="text"/> <parameter key="macro_type" value="data_value"/> <parameter key="statistics" value="average"/> <parameter key="attribute_name" value="text"/> <parameter key="example_index" value="1"/> <list key="additional_macros"/> </operator> <operator activated="true" class="hanminer:read_document" compatibility="1.0.003" expanded="true" height="68" name="Read Document" width="90" x="782" y="136"> <parameter key="encoding" value="UTF-8"/> <parameter key="import_from_file" value="false"/> <parameter key="text" value="%{text}"/> <parameter key="file" value="C:/Users/Rui/Downloads/archive (6)-chinese/chinese-dataset-subset.txt"/> </operator> <operator activated="true" class="hanminer:tokenize" compatibility="1.0.003" expanded="true" height="68" name="Tokenize" width="90" x="916" y="136"> <parameter key="high_speed_mode" value="false"/> </operator> <connect from_op="Open File" from_port="file" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="Read Document (2)" to_port="file"/> <connect from_op="Multiply" from_port="output 2" to_op="Read Document" to_port="file"/> <connect from_op="Read Document (2)" from_port="output" to_op="Documents to Data (2)" to_port="documents 1"/> <connect from_op="Documents to Data (2)" from_port="example set" to_op="Extract Macro" to_port="example set"/> <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document set"/> <connect from_op="Tokenize" from_port="document set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Greetings,
Jonas0 -
Thank you Jonas. That worked fine. Didn't think of macros here.0