Basic Question - replacing words in a document

apaul
apaul New Altair Community Member
edited November 5 in Community Q&A

Hi Experts,

 

I have set of ducments and would like to replace some of the word sets  with a single word before tokenize.

 

ex. follow up --> follow-up

 Set up --> Setup

 

How do I do this?

 

Thanks,

Aji

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member

    I don't know an easy way to do this in RapidMiner before you process the document without a lot of very complicated regular expression matching. But it is easy to do while you process the document----just add Generate n-grams of length 2 to your process, and then use Replace Tokens after that to substitute the ones that you want.

     

  • kayman
    kayman New Altair Community Member

    I would suggest to use the replace (Dictionary) operator. You first create a simple csv with 2 columns, the first containing your current word(s) and the second your replacement word. If there are not too many words to replace you don't have to worry too much about regular expressions, just have all the variations included. That's still manageable. Just take care of partial replacements so use your words wise. You could use a very basic regex character, the word boundery (\b when using in an operator, \\b when using in a text file), to ensure the words are considered as a whole.

     

    this would be something like :

     

    From                          To

    \\bMy word\\b             my-word

    \\bsample-data\\b      sample data

     

    next add this csv to the operator, set the from and to column headers, mark the 'use regular expressions" box and off you go.

     

    If you want to cover all of the possible typo's etc then more complex regular expressions can be an option, but if not keep it simple and use the boundery character to keep your words complete.

  • apaul
    apaul New Altair Community Member

    Rapidminer_SO.PNG

    Thanks kayman !

    Nice suggestion . But not working as expected meaning not replacing the words. Also how do I tokenize an example set?

     

     


    @kayman wrote:

    I would suggest to use the replace (Dictionary) operator. You first create a simple csv with 2 columns, the first containing your current word(s) and the second your replacement word. If there are not too many words to replace you don't have to worry too much about regular expressions, just have all the variations included. That's still manageable. Just take care of partial replacements so use your words wise. You could use a very basic regex character, the word boundery (\b when using in an operator, \\b when using in a text file), to ensure the words are considered as a whole.

     

    this would be something like :

     

    From                          To

    \\bMy word\\b             my-word

    \\bsample-data\\b      sample data

     

    next add this csv to the operator, set the from and to column headers, mark the 'use regular expressions" box and off you go.

     

    If you want to cover all of the possible typo's etc then more complex regular expressions can be an option, but if not keep it simple and use the boundery character to keep your words complete.



    @kayman wrote:

    I would suggest to use the replace (Dictionary) operator. You first create a simple csv with 2 columns, the first containing your current word(s) and the second your replacement word. If there are not too many words to replace you don't have to worry too much about regular expressions, just have all the variations included. That's still manageable. Just take care of partial replacements so use your words wise. You could use a very basic regex character, the word boundery (\b when using in an operator, \\b when using in a text file), to ensure the words are considered as a whole.

     

    this would be something like :

     

    From                          To

    \\bMy word\\b             my-word

    \\bsample-data\\b      sample data

     

    next add this csv to the operator, set the from and to column headers, mark the 'use regular expressions" box and off you go.

     

    If you want to cover all of the possible typo's etc then more complex regular expressions can be an option, but if not keep it simple and use the boundery character to keep your words complete.


     

  • kayman
    kayman New Altair Community Member

    What input does your document process operator get? Seems like there is something missing there.

    Apart from that good point, as I overlooked the fact you are working with documents and not data. Unfortunatly the text operators don't have a real dictionary driven replace option for this, though you could abuse the Stem (dictionary) for that to some extend.

     

    One option could be to use the documents to data operator, this will convert your full document to a data value, and this way you can use the replace(dictionary) option. Next you can use the process Documents from Data to process your cleaned content. Important is to ensure your data field is defined as text and not as nominal, as otherwise the process will happily ignore your content.

     

    A typical tokenization workflow would be something like transform cases -> tokenize (on words or spaces or line breaks or whatever makes sense) -> stopwords -> stemming (if readability is less important) inside the Documents from Data operator and prune using some try and error settings.