How to correct the wrong words?
Hello
How to get in rapidminer
Improved spelling of words?
For example a word
meeseg - > message
or
veeeery gooood - >very good
Does anyone know
Answers
-
I think I answered this same question in another thread. If you generate a wordlist first and you compile a list of substitutions you want to make, then you can use the "Replace Tokens" operator. If you are looking for an automated way to do this (i.e., RapidMiner identifies misspellings and replaces them automatically), there isn't a built-in solution for that. There might be some third party software you could access via an API though.
0 -
Hi @jozeftomas_2020,
As @Telcontar120 said, there isn't a built-in solution for performing what you want to do.
So I propose to use a Python script using the textblob library . Here some results :
However, when the words are too mispelled, the script is not able to correct them correctly (like the examples you gave) :
FYI, spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector” as implemented in the pattern library. It is about 70% accurate.
The process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.1.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
<parameter key="generator_type" value="comma_separated_text"/>
<list key="function_descriptions"/>
<list key="numeric_series_configuration"/>
<list key="date_series_configuration"/>
<list key="date_series_configuration (interval)"/>
<parameter key="input_csv_text" value="Id,text 1,meeseg 2,veeeery gooood"/>
</operator>
<operator activated="true" class="set_macros" compatibility="8.2.000" expanded="true" height="82" name="Set Text Atribute" width="90" x="246" y="34">
<list key="macros">
<parameter key="textAttribute" value="'text'"/>
</list>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
<parameter key="script" value="import pandas from textblob import TextBlob Text_Attribute = %{textAttribute} def spellingCorrection(text) : b = TextBlob(text) return b.correct() def rm_main(data): data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection) return data"/>
</operator>
<connect from_op="Create ExampleSet" from_port="output" to_op="Set Text Atribute" to_port="through 1"/>
<connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>To execute this process, you have to :
- Install Python on your computer
- Install the textblob library
- Install the Execute Python operator from the marketplace
- Set the name of your text attribute in the Set Macros operator
I hope it helps,
Regards,
Lionel
3 -
Hello
Thank youThe last part you said
Did not get Macro setup?
What exactly should I do?I want R to use this code
https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html
But I do not know how to run rapidminer on my data.
Maybe help me?I installed anaconda but I do not know how to install textblob and use it in rapidminer?:smileysad:
Can someone help?
Thank you0 -
Hi @jozeftomas_2020,
1. To set up the Set Macros :
Have you try to import the process I shared ? You have to enter in the parameters of this operator (in the "values" column)
the name of the attribute where there are the mispelled words.
2. To install textblob :
a. Type Win + R to open a window
b. Type "cmd" and then click OK
c. Type "pip install textblob" and click enter
textblob will be automatically installed on your computer.
Regards,
Lionel
0 -
Hi dear friend
I did all the steps
I want to correct spelling mistakes in my data, which has a text column
I loaded the data and then with the 'select attribute' operator I chose my text column and then I connected to the 'execute python' operator.
The column name I want to correct is 'text'.
But run this error
I do not know how to solve it
Can you help me once more?
Thanks a lot0 -
Hi
Have you set the name of your text attribute (text) in the set macros operator with quotes? (value ='text')
Regards,
Lionel0 -
-
Hi, I did the same for installing textblob. But is this error?
What should i do
"
2. To install textblob:
a. Type Win + R to open a window
b. Type "cmd" and then click OK
c. Type "pip install textblob" and click enter
textblob will be automatically installed on your computer."
0 -
Jozeftomas, can you share your process and your dataset. Tomorrow, I will try to find and fix the bug you mentionned.
Regards.
Lionel0 -
Hi Student_compute,
The 'pip' command is installed with Python.
So first install Python (Python 3.x) via
Anaconda.
Regards,
Lionel.0 -
Hello
But I installed Python first.
How should I do now?
Thank you my friend0 -
Hello, thank you very much for your response and kindness
I've got it from Twitter, in the photo above
I have a search twitter operator before nominal to text.
This
Can you tell what the problem is?
And how can I run the preprocess code on my tweets in RapidMiner?
https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html
Thanks if you get started
With respect and dedication0 -
Hi @jozeftomas_2020,
It will be very hard for us to understand your bug without your process, can you share it ?
and what you want to do in fine ?, correct the mispelled tweets ??
Regards,
Lionel
0 -
Hi @student_compute,
If you have, effectively, installed Python, 'pip' must be installed too. So I see only one solution :
You have to update your "environment variables" :
1/
- Search the pip.exe file on your computer. it is by default located in C:\Users\username\Anacondax\Scripts or C:\Users\username\Pythonx\Scripts. (where x = 2 or 3 according to the version of Python you installed).
or
- Type 'pip.exe' (with quote) in the search bar of windows 10 (bottom-left), then right click on the result and select open the location of the file.
2/ Then (here on Windows 10):
- open an explorer window
then click on properties
then
then
ikkthen
I
I hope it helps,
Regards,
Lionel
0 -
Hello
This is my process
I want to correct spelling mistakes in any tweets. And then I can do kmesan clustering. But I'm new to Python.
And in the RapidMiner program, I do not know how to write code for Python to achieve this goal.
Please, dear friend, if possible
With respect
I will be grateful . I'm waiting for your help0 -
Hi @jozeftomas_2020,
Here the operational process to correct mispelled tweets :
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="8.0.010" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="136">
<parameter key="connection" value="dkk"/>
<parameter key="query" value="iphone"/>
<parameter key="limit" value="10"/>
<parameter key="language" value="en"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Text"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.2.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="136"/>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="136">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="2.0"/>
<parameter key="prune_above_percent" value="70.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
<operator activated="true" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="581" y="34"/>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_macros" compatibility="8.2.001" expanded="true" height="82" name="Set Text Atribute" width="90" x="514" y="238">
<list key="macros">
<parameter key="textAttribute" value="'text'"/>
</list>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="581" y="136">
<parameter key="script" value="import pandas from textblob import TextBlob Text_Attribute = %{textAttribute} def spellingCorrection(txt) : b = TextBlob(txt) return b.correct() def rm_main(data): data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection) return data"/>
</operator>
<connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Set Text Atribute" to_port="through 1"/>
<connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Note that according to the number of tweets, the correction of the tweets may take many minutes.
Regards,
Lionel
1 -
@lionelderkrikor this is quite handy, thank you for this!
0 -
0
-
Hello
Thank you so much
Really your codes will surprise me
I do not know how to thank
But the master
In one comment, I typed a false word and run the program. As a result, the word was not corrected
Maybe check
like this
iphon worst phone appl made helo meseg
After running
iphon worst phone appl made helo meseg
I wanted to correct the two words helo, meseg as hello, message
Thank you0 -
Hi @jozeftomas_2020,
I executed the script with your examples and here what I get (in your case, I don't know why, no correction is performed):
That's not what you're waiting for, but the spelling corrector try to find the nearest correct word from the mispelled word.
So :
- "held" is nearer from "helo" than "hello".
- "meet" is nearer from "meseg" than "message".
I think it will be very difficult to do best.
Regards,
Lionel
0 -
Hello.
Yes you are right.
Thanks again.
Is it possible just to send your last example xml file?
Thankful0 -
Hi @jozeftomas_2020,
Here the last process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
<parameter key="generator_type" value="comma_separated_text"/>
<list key="function_descriptions"/>
<list key="numeric_series_configuration"/>
<list key="date_series_configuration"/>
<list key="date_series_configuration (interval)"/>
<parameter key="input_csv_text" value="Id,text 1,helo 2,meseg 3,iphon worst phone appl made helo meseg"/>
</operator>
<operator activated="true" class="set_macros" compatibility="8.2.001" expanded="true" height="82" name="Set Text Atribute" width="90" x="246" y="34">
<list key="macros">
<parameter key="textAttribute" value="'text'"/>
</list>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
<parameter key="script" value="import pandas from textblob import TextBlob Text_Attribute = %{textAttribute} def spellingCorrection(text) : b = TextBlob(text) return b.correct() def rm_main(data): data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection) return data"/>
</operator>
<connect from_op="Create ExampleSet" from_port="output" to_op="Set Text Atribute" to_port="through 1"/>
<connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Regards,
Lionel
0