Text mining in utf-8

i_anicka
i_anicka New Altair Community Member
edited November 2024 in Community Q&A

Hello all,

 

I need to use RapidMiner for text mining in Cyrilic.
I tried setting the encoding to utf-8. It gives me some results which are displayed in characters instead of cyrilic words.

 

Thanks,

 

 

 

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answer

  • i_anicka
    i_anicka New Altair Community Member
    Answer ✓

    Hi guys,

     

    I have solved my problem.

    I had set the utf-8 encoding everywhere except on the process level.

    I changed this and it works!

     

    Thank you all for your replies.

     

    Ana,

     

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

    could you maybe post an example?
    ~Martin

  • JEdward
    JEdward New Altair Community Member

    It could be that your original document isn't in UTF-8, but in another encoding. 

    One way to be absolutely sure is to create a loop which changes the encoding parameter in your process documents using macros and to look at all the resulting outputs.  The one that looks 'right'. 

     

     

  • sgenzer
    sgenzer
    Altair Employee

    agreed.  Just did a quick check and there's no problem with Cyrillic in UTF-8.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.3.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
    <parameter key="text" value="Для поиска нажмите Ввод"/>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

  • i_anicka
    i_anicka New Altair Community Member
    Answer ✓

    Hi guys,

     

    I have solved my problem.

    I had set the utf-8 encoding everywhere except on the process level.

    I changed this and it works!

     

    Thank you all for your replies.

     

    Ana,

     

  • arunasethupathy
    arunasethupathy New Altair Community Member

    I want to use Tamil language for text mining

    Where you have change the UTF-8 option for this

    I have tried in process level but unable to get

    Plz anybody give the answer

  • arunasethupathy
    arunasethupathy New Altair Community Member

    for changing the unicode option to UTF-8 ( for processing tamil language)

    I have changed in the Rapidminer studio preference - encoding to UTF-8

    I have simply read the document using ReadDocument operator in Text mining extension

    But it is not working, the screen shot is attached ( doc7.docx)

    Kindly help me to sort out this problem

    Tahnk you

     

     

  • sgenzer
    sgenzer
    Altair Employee

    Hello @arunasethupathy - so Tamil is not a language I have worked with before.  Could you please post your XML process AND your text document (in Tamil) so I can take a look?

     

    Thank you.

     

    Scott

     

     

  • arunasethupathy
    arunasethupathy New Altair Community Member

    Sir,

    Kindly find the attached for the sample tamil text document

  • sgenzer
    sgenzer
    Altair Employee

    thank you @arunasethupathy.  Can you please also post your XML process?

     

    Scott

     

     

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.