"Text pattern identification"

ratheesan
ratheesan New Altair Community Member
edited November 5 in Community Q&A
Hello,
I have a text document related with insurance.In that data there is some words like "No alcohol content" and "alcohol content".While working with this documents the RM considering all "alcohol" together.How can I count the number of "alcohol" with neighbor term"no".

Thanks
Ratheesan

Answers

  • RalfKlinkenberg
    RalfKlinkenberg New Altair Community Member
    Hello Ratheesan,

    you can use the RapidMiner text preprocessing operator TermNGramGenerator in order to not only count individual words, but also word pairs or other multi-word terms. Alternatively or in addition, you can also use a TokenReplace operator before the StringTokenizer to map multi-word terms like no alcohol to one word tokens:

    operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
            </list>
            <list key="namespaces">
            </list>
            <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
            </operator>
            <operator name="Replace 'no alcohol' by 'noalcohol' to count it us one new word" class="TokenReplace">
                <list key="replace_dictionary">
                  <parameter key="no alcohol" value="noalcohol"/>
                </list>
            </operator>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="Consider pairs of words in addition to individual words" class="TermNGramGenerator">
            </operator>
        </operator>
    </operator>
    Cheers,
    Ralf
  • ratheesan
    ratheesan New Altair Community Member
    Hello Ralf ,
    I really appreciate your help.It is working fine.Here I am getting all the combinations of words such as single word,2 words,3 words etc.Here we can control the maximum number of words only.But I need to extract the combination of 3 words onwards.How can I achieve this goal.

    Thanks
    Ratheesan