I think I answered this same question in another thread. If you generate a wordlist first and you compile a list of substitutions you want to make, then you can use the "Replace Tokens" operator. If you are looking for an automated way to do this (i.e., RapidMiner identifies misspellings and replaces them automatically), there isn't a built-in solution for that. There might be some third party software you could access via an API though.

lionelderkrikor

New Altair Community Member

Jun 22, 2018

Hi @jozeftomas_2020,

As @Telcontar120 said, there isn't a built-in solution for performing what you want to do.

So I propose to use a Python script using the textblob library . Here some results :

However, when the words are too mispelled, the script is not able to correct them correctly (like the examples you gave) :

FYI, spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector” as implemented in the pattern library. It is about 70% accurate.

The process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.1.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
        <parameter key="generator_type" value="comma_separated_text"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="Id,text&#10;1,meeseg&#10;2,veeeery gooood"/>
      </operator>
      <operator activated="true" class="set_macros" compatibility="8.2.000" expanded="true" height="82" name="Set Text Atribute" width="90" x="246" y="34">
        <list key="macros">
          <parameter key="textAttribute" value="'text'"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
        <parameter key="script" value="import pandas&#10;from textblob import TextBlob&#10;&#10;Text_Attribute = %{textAttribute}&#10;&#10;&#10;def spellingCorrection(text) : &#10;  &#10;  b = TextBlob(text)&#10;  return b.correct()&#10;&#10;&#10;def rm_main(data):&#10;&#10;&#10;  data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection)&#10;&#10;  return data"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Set Text Atribute" to_port="through 1"/>
      <connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

To execute this process, you have to :

- Install Python on your computer

- Install the textblob library

- Install the Execute Python operator from the marketplace

- Set the name of your text attribute in the Set Macros operator

I hope it helps,

Regards,

Lionel

Spelling_Correction.png

Spelling_Correction_2.png

jozeftomas_2020

Banned

Jun 23, 2018

Hello
Thank you

The last part you said
Did not get Macro setup?
What exactly should I do?

I want R to use this code

https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html
But I do not know how to run rapidminer on my data.
Maybe help me?

I installed anaconda but I do not know how to install textblob and use it in rapidminer?:smileysad:
Can someone help?
Thank you

lionelderkrikor

New Altair Community Member

Jun 23, 2018

Hi @jozeftomas_2020,

1. To set up the Set Macros :

Have you try to import the process I shared ? You have to enter in the parameters of this operator (in the "values" column)

the name of the attribute where there are the mispelled words.

2. To install textblob :

a. Type Win + R to open a window

b. Type "cmd" and then click OK

c. Type "pip install textblob" and click enter

textblob will be automatically installed on your computer.

Regards,

Lionel

jozeftomas_2020

Banned

Jul 4, 2018

Hi dear friend
I did all the steps
I want to correct spelling mistakes in my data, which has a text column

I loaded the data and then with the 'select attribute' operator I chose my text column and then I connected to the 'execute python' operator.

The column name I want to correct is 'text'.

But run this error

I do not know how to solve it
Can you help me once more?

Thanks a lot

p1.JPG

p2.JPG

lionelderkrikor

New Altair Community Member

Jul 4, 2018

Hi

Have you set the name of your text attribute (text) in the set macros operator with quotes? (value ='text')

Regards,

Lionel

jozeftomas_2020

Banned

Jul 5, 2018

Hello
Yes you got it
But it still has an error
look

Maybe help me Allow me to send a photo or sample process?
Thanks a lot
With respect

mm1.JPG

mm.JPG

mm2.JPG

student_compute

New Altair Community Member

Jul 6, 2018

Hi, I did the same for installing textblob. But is this error?

What should i do

"
2. To install textblob:

a. Type Win + R to open a window

b. Type "cmd" and then click OK

c. Type "pip install textblob" and click enter

textblob will be automatically installed on your computer.

Capture.JPG

lionelderkrikor

New Altair Community Member

Jul 6, 2018

Jozeftomas, can you share your process and your dataset. Tomorrow, I will try to find and fix the bug you mentionned.

Regards.

Lionel

lionelderkrikor

New Altair Community Member

Jul 6, 2018

Hi Student_compute,

The 'pip' command is installed with Python.
So first install Python (Python 3.x) via
Anaconda.

Regards,

Lionel.

student_compute

New Altair Community Member

Jul 7, 2018

Hello
But I installed Python first.
How should I do now?
Thank you my friend

jozeftomas_2020

Banned

Jul 7, 2018

Hello, thank you very much for your response and kindness
I've got it from Twitter, in the photo above
I have a search twitter operator before nominal to text.
This
Can you tell what the problem is?
And how can I run the preprocess code on my tweets in RapidMiner?
https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html
Thanks if you get started
With respect and dedication

lionelderkrikor

New Altair Community Member

Jul 7, 2018

Hi @jozeftomas_2020,

It will be very hard for us to understand your bug without your process, can you share it ?

and what you want to do in fine ?, correct the mispelled tweets ??

Regards,

Lionel

lionelderkrikor

New Altair Community Member

Jul 7, 2018

Hi @student_compute,

If you have, effectively, installed Python, 'pip' must be installed too. So I see only one solution :

You have to update your "environment variables" :

- Search the pip.exe file on your computer. it is by default located in C:\Users\username\Anacondax\Scripts or C:\Users\username\Pythonx\Scripts. (where x = 2 or 3 according to the version of Python you installed).

- Type 'pip.exe' (with quote) in the search bar of windows 10 (bottom-left), then right click on the result and select open the location of the file.

2/ Then (here on Windows 10):

- open an explorer window

then click on properties

then

ikk

then

I hope it helps,

Regards,

Lionel

Pip_Installation.png

Pip_Installation_2.png

Pip_Installation_3.png

Pip_Installation_4.png

Pip_Installation_5.png

jozeftomas_2020

Banned

Jul 8, 2018

Hello
This is my process
I want to correct spelling mistakes in any tweets. And then I can do kmesan clustering. But I'm new to Python.
And in the RapidMiner program, I do not know how to write code for Python to achieve this goal.
Please, dear friend, if possible
With respect
I will be grateful . I'm waiting for your help

py.zip

lionelderkrikor

New Altair Community Member

Jul 8, 2018

Hi @jozeftomas_2020,

Here the operational process to correct mispelled tweets :

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="social_media:search_twitter" compatibility="8.0.010" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="136">
        <parameter key="connection" value="dkk"/>
        <parameter key="query" value="iphone"/>
        <parameter key="limit" value="10"/>
        <parameter key="language" value="en"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="136">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Text"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="8.2.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="136"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="136">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="2.0"/>
        <parameter key="prune_above_percent" value="70.0"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="581" y="34"/>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_macros" compatibility="8.2.001" expanded="true" height="82" name="Set Text Atribute" width="90" x="514" y="238">
        <list key="macros">
          <parameter key="textAttribute" value="'text'"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="581" y="136">
        <parameter key="script" value="import pandas&#10;from textblob import TextBlob&#10;&#10;Text_Attribute = %{textAttribute}&#10;&#10;&#10;def spellingCorrection(txt) : &#10;  &#10;  b = TextBlob(txt)&#10;  return b.correct()&#10;&#10;&#10;def rm_main(data):&#10;&#10;&#10;  data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection)&#10;&#10;  return data"/>
      </operator>
      <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Text Atribute" to_port="through 1"/>
      <connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Note that according to the number of tweets, the correction of the tweets may take many minutes.

Regards,

Lionel

Thomas_Ott

New Altair Community Member

Jul 9, 2018

@lionelderkrikor this is quite handy, thank you for this!

lionelderkrikor

New Altair Community Member

Jul 9, 2018

Hi,

You're welcome, @Thomas_Ott.

Happy corrections !

Regards,

Lionel

jozeftomas_2020

Banned

Jul 10, 2018

Hello
Thank you so much
Really your codes will surprise me
I do not know how to thank
But the master
In one comment, I typed a false word and run the program. As a result, the word was not corrected
Maybe check
like this
iphon worst phone appl made helo meseg
After running
iphon worst phone appl made helo meseg
I wanted to correct the two words helo, meseg as hello, message
Thank you

lionelderkrikor

New Altair Community Member

Jul 10, 2018

Hi @jozeftomas_2020,

I executed the script with your examples and here what I get (in your case, I don't know why, no correction is performed):

That's not what you're waiting for, but the spelling corrector try to find the nearest correct word from the mispelled word.

So :

- "held" is nearer from "helo" than "hello".

- "meet" is nearer from "meseg" than "message".

I think it will be very difficult to do best.

Regards,

Lionel

Spelling_Correction_4.png

jozeftomas_2020

Banned

Jul 11, 2018

Hello.
Yes you are right.
Thanks again.
Is it possible just to send your last example xml file?
Thankful

lionelderkrikor

New Altair Community Member

Jul 11, 2018

Hi @jozeftomas_2020,

Here the last process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
        <parameter key="generator_type" value="comma_separated_text"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="Id,text&#10;1,helo&#10;2,meseg&#10;3,iphon worst phone appl made helo meseg"/>
      </operator>
      <operator activated="true" class="set_macros" compatibility="8.2.001" expanded="true" height="82" name="Set Text Atribute" width="90" x="246" y="34">
        <list key="macros">
          <parameter key="textAttribute" value="'text'"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
        <parameter key="script" value="import pandas&#10;from textblob import TextBlob&#10;&#10;Text_Attribute = %{textAttribute}&#10;&#10;&#10;def spellingCorrection(text) : &#10;  &#10;  b = TextBlob(text)&#10;  return b.correct()&#10;&#10;&#10;def rm_main(data):&#10;&#10;&#10;  data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection)&#10;&#10;  return data"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Set Text Atribute" to_port="through 1"/>
      <connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards,

Lionel

🎉Community Raffle - Win $25

How to correct the wrong words?

Find more posts tagged with

Quick Links