How to correct the wrong words?

jozeftomas_2020
jozeftomas_2020 New Altair Community Member
edited November 5 in Community Q&A


Hello
How to get in rapidminer
Improved spelling of words?
For example a word

meeseg - > message
or
veeeery gooood - >very good

Does anyone know

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member

    I think I answered this same question in another thread.  If you generate a wordlist first and you compile a list of substitutions you want to make, then you can use the "Replace Tokens" operator.  If you are looking for an automated way to do this (i.e., RapidMiner identifies misspellings and replaces them automatically), there isn't a built-in solution for that.  There might be some third party software you could access via an API though.

     

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @jozeftomas_2020,

     

    As @Telcontar120 said, there isn't a built-in solution for performing what you want to do.

    So I propose to use a Python script using the textblob library . Here some results : 

    Spelling_Correction.png

     

    However, when the words are too mispelled, the script is not able to correct them correctly (like the examples you gave) : 

     

    Spelling_Correction_2.png

     

    FYI, spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector” as implemented in the pattern library. It is about 70% accurate.

     

    The process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.1.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="Id,text&#10;1,meeseg&#10;2,veeeery gooood"/>
    </operator>
    <operator activated="true" class="set_macros" compatibility="8.2.000" expanded="true" height="82" name="Set Text Atribute" width="90" x="246" y="34">
    <list key="macros">
    <parameter key="textAttribute" value="'text'"/>
    </list>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
    <parameter key="script" value="import pandas&#10;from textblob import TextBlob&#10;&#10;Text_Attribute = %{textAttribute}&#10;&#10;&#10;def spellingCorrection(text) : &#10; &#10; b = TextBlob(text)&#10; return b.correct()&#10;&#10;&#10;def rm_main(data):&#10;&#10;&#10; data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection)&#10;&#10; return data"/>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Set Text Atribute" to_port="through 1"/>
    <connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    To execute this process, you have to :

     - Install Python on your computer

     - Install the textblob library

     - Install the Execute Python operator from the marketplace

     - Set the name of your text attribute in the Set Macros operator

    Spelling_Correction_3.png

     

    I hope it helps,

     

    Regards,

     

    Lionel

     

     

     

  • jozeftomas_2020
    jozeftomas_2020 New Altair Community Member

    Hello
    Thank you :heart:

    The last part you said
    Did not get Macro setup?
    What exactly should I do?

    I want R to use this code

    https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html
    But I do not know how to run rapidminer on my data.
    Maybe help me?

    I installed anaconda but I do not know how to install textblob and use it in rapidminer?:smileysad:
    Can someone help?
    Thank you

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @jozeftomas_2020,

     

    1. To set up the Set Macros :

    Have you try to import the process I shared ? You have to enter in the parameters of this operator (in the "values" column)

    the name of the attribute where there are the mispelled words.

     

    2. To install textblob :

     a. Type Win + R to open a window

     b. Type "cmd" and then click OK

     c. Type "pip install textblob" and click enter

    textblob will be automatically installed on your computer.

     

    Regards,

     

    Lionel

     

     

  • jozeftomas_2020
    jozeftomas_2020 New Altair Community Member

    Hi dear friend
    I did all the steps
    I want to correct spelling mistakes in my data, which has a text column

    I loaded the data and then with the 'select attribute' operator I chose my text column and then I connected to the 'execute python' operator.
    p2.JPG
    The column name I want to correct is 'text'.

    But run this error
    p1.JPG
    I do not know how to solve it
    Can you help me once more?

    Thanks a lot

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi

    Have you set the name of your text attribute (text) in the set macros operator with quotes? (value ='text')

    Regards,

    Lionel
  • jozeftomas_2020
    jozeftomas_2020 New Altair Community Member

    Hello
    Yes you got it
    But it still has an error
    look

    mm.JPGmm1.JPGmm2.JPG

     

    Maybe help me Allow me to send a photo or sample process?
    Thanks a lot
    With respect

  • student_compute
    student_compute New Altair Community Member

    Hi, I did the same for installing textblob. But is this error?

     

    What should i do


    "
    2. To install textblob:

     
    a. Type Win + R to open a window

     
    b. Type "cmd" and then click OK

     
    c. Type "pip install textblob" and click enter

    textblob will be automatically installed on your computer.

    "

    Capture.JPG

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Jozeftomas, can you share your process and your dataset. Tomorrow, I will try to find and fix the bug you mentionned.

    Regards.

    Lionel
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi Student_compute,

    The 'pip' command is installed with Python.
    So first install Python (Python 3.x) via
    Anaconda.

    Regards,

    Lionel.
  • student_compute
    student_compute New Altair Community Member

    Hello
    But I installed Python first.
    How should I do now?
    Thank you my friend

     

  • jozeftomas_2020
    jozeftomas_2020 New Altair Community Member
    Hello, thank you very much for your response and kindness
    I've got it from Twitter, in the photo above
    I have a search twitter operator before nominal to text.
    This
    Can you tell what the problem is?
    And how can I run the preprocess code on my tweets in RapidMiner?
    https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html
    Thanks if you get started
    With respect and dedication
  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @jozeftomas_2020,

     

    It will be very hard for us to understand your bug without your process, can you share it ?

    and what you want to do in fine ?, correct the mispelled tweets ??

     

    Regards,

     

    Lionel

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @student_compute,

     

    If you have, effectively, installed Python, 'pip' must be installed too. So I see only one solution : 

    You have to update your "environment variables" : 

    1/

     - Search the pip.exe file on your computer. it is by default located in C:\Users\username\Anacondax\Scripts or C:\Users\username\Pythonx\Scripts. (where x = 2 or 3 according to the version of Python you installed).

    or

     - Type 'pip.exe' (with quote) in the search bar of windows 10 (bottom-left), then right click on the result and select open the location of the file.

     

     

    2/ Then (here on Windows 10): 

     - open an explorer window

     

    Pip_Installation.png

    then click on properties

     

    Pip_Installation_2.png

     

    then

     

    Pip_Installation_3.png

     

    then


    Pip_Installation_4.pngikk

     

    then

    Pip_Installation_5.pngI

    I hope it helps,

     

    Regards,

     

    Lionel

     

     

     

     

     

     

     

  • jozeftomas_2020
    jozeftomas_2020 New Altair Community Member

    Hello
    This is my process
    I want to correct spelling mistakes in any tweets. And then I can do kmesan clustering. But I'm new to Python.
    And in the RapidMiner program, I do not know how to write code for Python to achieve this goal.
    Please, dear friend, if possible
    With respect
    I will be grateful . I'm waiting for your help

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @jozeftomas_2020,

     

    Here the operational process to correct mispelled tweets : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="social_media:search_twitter" compatibility="8.0.010" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="136">
    <parameter key="connection" value="dkk"/>
    <parameter key="query" value="iphone"/>
    <parameter key="limit" value="10"/>
    <parameter key="language" value="en"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="136">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Text"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="8.2.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="136"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="136">
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="2.0"/>
    <parameter key="prune_above_percent" value="70.0"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
    <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
    <operator activated="true" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="581" y="34"/>
    <connect from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
    <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="set_macros" compatibility="8.2.001" expanded="true" height="82" name="Set Text Atribute" width="90" x="514" y="238">
    <list key="macros">
    <parameter key="textAttribute" value="'text'"/>
    </list>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="581" y="136">
    <parameter key="script" value="import pandas&#10;from textblob import TextBlob&#10;&#10;Text_Attribute = %{textAttribute}&#10;&#10;&#10;def spellingCorrection(txt) : &#10; &#10; b = TextBlob(txt)&#10; return b.correct()&#10;&#10;&#10;def rm_main(data):&#10;&#10;&#10; data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection)&#10;&#10; return data"/>
    </operator>
    <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Text Atribute" to_port="through 1"/>
    <connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Note that according to the number of tweets, the correction of the tweets may take many minutes.

     

    Regards,

     

    Lionel

     

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    @lionelderkrikor this is quite handy, thank you for this!

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi,

     

    You're welcome, @Thomas_Ott.

     

    Happy corrections !

     

    Regards,

     

    Lionel

  • jozeftomas_2020
    jozeftomas_2020 New Altair Community Member
    Hello
    Thank you so much
    Really your codes will surprise me
    I do not know how to thank
    But the master
    In one comment, I typed a false word and run the program. As a result, the word was not corrected
    Maybe check
    like this
    iphon worst phone appl made helo meseg
    After running
    iphon worst phone appl made helo meseg
    I wanted to correct the two words helo, meseg as hello, message
    Thank you

     
  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @jozeftomas_2020,

     

    I executed the script with your examples and here what I get (in your case, I don't know why, no correction is performed): 

    Spelling_Correction_4.png

     

    That's not what you're waiting for, but the spelling corrector try to find the nearest correct word from the mispelled word.

    So : 

     - "held" is nearer from "helo" than "hello".

     - "meet" is nearer from "meseg" than "message".

     

    I think it will be very difficult to do best.

     

    Regards,

     

    Lionel

     

  • jozeftomas_2020
    jozeftomas_2020 New Altair Community Member

    Hello.
    Yes you are right.
    Thanks again.:heart:
    Is it possible just to send your last example xml file?
    Thankful

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @jozeftomas_2020,

     

    Here the last process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="Id,text&#10;1,helo&#10;2,meseg&#10;3,iphon worst phone appl made helo meseg"/>
    </operator>
    <operator activated="true" class="set_macros" compatibility="8.2.001" expanded="true" height="82" name="Set Text Atribute" width="90" x="246" y="34">
    <list key="macros">
    <parameter key="textAttribute" value="'text'"/>
    </list>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
    <parameter key="script" value="import pandas&#10;from textblob import TextBlob&#10;&#10;Text_Attribute = %{textAttribute}&#10;&#10;&#10;def spellingCorrection(text) : &#10; &#10; b = TextBlob(text)&#10; return b.correct()&#10;&#10;&#10;def rm_main(data):&#10;&#10;&#10; data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection)&#10;&#10; return data"/>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Set Text Atribute" to_port="through 1"/>
    <connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

    Lionel