Text mining in utf-8

New Altair Community Member

Nov 7, 2016

Updated Nov 5, 2024 by Jocelyn

Hello all,

I need to use RapidMiner for text mining in Cyrilic.
I tried setting the encoding to utf-8. It gives me some results which are displayed in characters instead of cyrilic words.

Thanks,

Find more posts tagged with

AI Studio

Text Mining + NLP

Sort by:

1 - 9 of 91

MartinLiebig

Altair Employee

Nov 7, 2016

Hi,

could you maybe post an example?
~Martin

JEdward

New Altair Community Member

Nov 7, 2016

It could be that your original document isn't in UTF-8, but in another encoding.

One way to be absolutely sure is to create a loop which changes the encoding parameter in your process documents using macros and to look at all the resulting outputs. The one that looks 'right'.

sgenzer

Altair Employee

Nov 7, 2016

agreed. Just did a quick check and there's no problem with Cyrillic in UTF-8.

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="7.3.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
        <parameter key="text" value="Для поиска нажмите Ввод"/>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34">
        <parameter key="text_attribute" value="text"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

i_anicka

New Altair Community Member

Accepted Answer

Nov 7, 2016

Hi guys,

I have solved my problem.

I had set the utf-8 encoding everywhere except on the process level.

I changed this and it works!

Thank you all for your replies.

Ana,

arunasethupathy

New Altair Community Member

Dec 8, 2017

I want to use Tamil language for text mining

Where you have change the UTF-8 option for this

I have tried in process level but unable to get

Plz anybody give the answer

arunasethupathy

New Altair Community Member

Dec 8, 2017

for changing the unicode option to UTF-8 ( for processing tamil language)

I have changed in the Rapidminer studio preference - encoding to UTF-8

I have simply read the document using ReadDocument operator in Text mining extension

But it is not working, the screen shot is attached ( doc7.docx)

Kindly help me to sort out this problem

Tahnk you

Doc7.docx

sgenzer

Altair Employee

Dec 12, 2017

Hello @arunasethupathy - so Tamil is not a language I have worked with before. Could you please post your XML process AND your text document (in Tamil) so I can take a look?

Thank you.

Scott

arunasethupathy

New Altair Community Member

Dec 18, 2017

Sir,

Kindly find the attached for the sample tamil text document

vikatan.txt

sgenzer

Altair Employee

Dec 18, 2017

thank you @arunasethupathy. Can you please also post your XML process?