Text mining in utf-8

i_anicka
i_anicka New Altair Community Member
edited November 5 in Community Q&A

Hello all,

 

I need to use RapidMiner for text mining in Cyrilic.
I tried setting the encoding to utf-8. It gives me some results which are displayed in characters instead of cyrilic words.

 

Thanks,

 

 

 

Best Answer

  • i_anicka
    i_anicka New Altair Community Member
    Answer ✓

    Hi guys,

     

    I have solved my problem.

    I had set the utf-8 encoding everywhere except on the process level.

    I changed this and it works!

     

    Thank you all for your replies.

     

    Ana,

     

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

    could you maybe post an example?
    ~Martin

  • JEdward
    JEdward New Altair Community Member

    It could be that your original document isn't in UTF-8, but in another encoding. 

    One way to be absolutely sure is to create a loop which changes the encoding parameter in your process documents using macros and to look at all the resulting outputs.  The one that looks 'right'. 

     

     

  • sgenzer
    sgenzer
    Altair Employee

    agreed.  Just did a quick check and there's no problem with Cyrillic in UTF-8.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.3.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
    <parameter key="text" value="Для поиска нажмите Ввод"/>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

  • i_anicka
    i_anicka New Altair Community Member
    Answer ✓

    Hi guys,

     

    I have solved my problem.

    I had set the utf-8 encoding everywhere except on the process level.

    I changed this and it works!

     

    Thank you all for your replies.

     

    Ana,

     

  • arunasethupathy
    arunasethupathy New Altair Community Member

    I want to use Tamil language for text mining

    Where you have change the UTF-8 option for this

    I have tried in process level but unable to get

    Plz anybody give the answer

  • arunasethupathy
    arunasethupathy New Altair Community Member

    for changing the unicode option to UTF-8 ( for processing tamil language)

    I have changed in the Rapidminer studio preference - encoding to UTF-8

    I have simply read the document using ReadDocument operator in Text mining extension

    But it is not working, the screen shot is attached ( doc7.docx)

    Kindly help me to sort out this problem

    Tahnk you

     

     

  • sgenzer
    sgenzer
    Altair Employee

    Hello @arunasethupathy - so Tamil is not a language I have worked with before.  Could you please post your XML process AND your text document (in Tamil) so I can take a look?

     

    Thank you.

     

    Scott

     

     

  • arunasethupathy
    arunasethupathy New Altair Community Member

    Sir,

    Kindly find the attached for the sample tamil text document

  • sgenzer
    sgenzer
    Altair Employee

    thank you @arunasethupathy.  Can you please also post your XML process?

     

    Scott