Text mining in utf-8
Hello all,
I need to use RapidMiner for text mining in Cyrilic.
I tried setting the encoding to utf-8. It gives me some results which are displayed in characters instead of cyrilic words.
Thanks,
Best Answer
-
Hi guys,
I have solved my problem.
I had set the utf-8 encoding everywhere except on the process level.
I changed this and it works!
Thank you all for your replies.
Ana,
1
Answers
-
Hi,
could you maybe post an example?
~Martin0 -
It could be that your original document isn't in UTF-8, but in another encoding.
One way to be absolutely sure is to create a loop which changes the encoding parameter in your process documents using macros and to look at all the resulting outputs. The one that looks 'right'.
1 -
agreed. Just did a quick check and there's no problem with Cyrillic in UTF-8.
<?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="7.3.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
<parameter key="text" value="Для поиска нажмите Ввод"/>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34">
<parameter key="text_attribute" value="text"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Scott
0 -
Hi guys,
I have solved my problem.
I had set the utf-8 encoding everywhere except on the process level.
I changed this and it works!
Thank you all for your replies.
Ana,
1 -
I want to use Tamil language for text mining
Where you have change the UTF-8 option for this
I have tried in process level but unable to get
Plz anybody give the answer
0 -
for changing the unicode option to UTF-8 ( for processing tamil language)
I have changed in the Rapidminer studio preference - encoding to UTF-8
I have simply read the document using ReadDocument operator in Text mining extension
But it is not working, the screen shot is attached ( doc7.docx)
Kindly help me to sort out this problem
Tahnk you
0 -
Hello @arunasethupathy - so Tamil is not a language I have worked with before. Could you please post your XML process AND your text document (in Tamil) so I can take a look?
Thank you.
Scott
0 -
-
0