Replacing whole words with dictionnary
EL75
New Altair Community Member
Hi Rapid miner community,
I don't find the solution to replace whole words after a "read excel" operator. If I use a "Replace (dictionary)" operator linked with an excel file, words are partially substituted - as they are not tokenized - and sometimes part of the word is substituted and aggregated with the rest of the word. for instance, if in my dictionary I have many entries for the misspelling form of the word « application » (e.g app, apple, etc.) the result can be « applicationlicationncation » ... The reason is that, in my data set, I have many terms misspelled therefore I'd like to use such process to substitute the common misspelling forms.
Inside the « text processing » operator, after tokenization I could do it, but there’s no operator to handle this (as far I’ve seen). the « replace token could do the job, but I have to enter one by one all the entries that I presently have in my misspelling dictionary..
thanks for your help !
thanks for your help !
0
Best Answer
-
Did you tick the regex box on the replace operator? You also do not need the \b in your with, only in your substitute part.
Extending the match range is a question of modifying what you want to see in your boundaries, as in attached simple example.
Input : my andoid aplication isn't an androit app.
output : my android application isn't an android application.
5
Answers
-
use regex wordboundaries. For instance \bapp\b will only match words that are exactly app, when it is in the middle, end or beginning of a sentence.1
-
Thanks Kayman, for your response, i've tried it, I duplicated my excel sheet - see file enclosed - but it the operator REPLACE considers \b as part of the words and not as a REGEX.. so that the operator just don't find the word and replace nothing. And as I have many misspelling ways for "application"
best regards0 -
Did you tick the regex box on the replace operator? You also do not need the \b in your with, only in your substitute part.
Extending the match range is a question of modifying what you want to see in your boundaries, as in attached simple example.
Input : my andoid aplication isn't an androit app.
output : my android application isn't an android application.
5