Words/String Matching Producing true or false

asn4293
asn4293 New Altair Community Member
edited November 5 in Community Q&A

I have a data set for example:

 

Internal Experience Functional Area
Marketing & Sales
Marketing & Sales
Controlling/Accounting
Marketing & Sales|Marketing & Sales
General Management
Marketing & Sales
Logistics|Logistics|Logistics|Logistics
Logistics
Marketing & Sales

I want to match it with my requirement xlsx file which contain column:

Match words
sales

 

This matching is string and is not case sensitive meaning even if it is small letters and capital it should work.After matching it should give me result as true or false or 1 or 0. Result should look like this.

Internal Experience Functional Area Matching result
Marketing & Sales TRUE
Marketing & Sales TRUE
Controlling/Accounting FALSE
Marketing & Sales|Marketing & Sales TRUE
General Management FALSE
Marketing & Sales TRUE
Logistics|Logistics|Logistics|Logistics FALSE
Logistics FALSE
Marketing & Sales TRUE


I dont know how it can be done. please help

Answers

  • kypexin
    kypexin New Altair Community Member

    Hi @asn4293

     

    Let's assume that 'Area' is a short name for the attribute containing strings. 

    Use 'Generate Attributes' operator to create new attribute named 'MatchingResult', with the following parameters:

     

    attribute name: MatchingResult

    function expressions: contains(lower([Area]), 'sales')

     

    This would generate 'true' value in case lowercase 'Area' contains 'sales' substring, and 'false' otherwise.

     

    Screenshot 2018-04-26 08.18.20.png

  • asn4293
    asn4293 New Altair Community Member

    @kypexin

    Thank you for your feedback, but this is only reasonable when we have one search and we can write query everytime, I have approximately 1000 things to match with huge data, in that case this would not be a suitable case.

    I want to specify column where there are words to be matched with each other. 


  • kypexin
    kypexin New Altair Community Member

    Hi @asn4293

     

    So the task becomes much more generalized, where you have to fuzzy match two columns of text attributes, which technically makes many-to-many matching. This sounds like a bit tricky task to be acomplished with RapidMiner, at least I cannot come up with an easy solution right out of my head... However, my suggestions are:

     

    1. Have a look at a very ionteresting trick from @BalazsBarany website on how to perform generic joins in RM and maybe this can give you some inspiration: https://datascientist.at/2016/06/generic-joins-in-rapidminer/#english
    2. Maybe also you should consider some Python script to accomlish this task which at the end might be much faster and simpler in implementation.

    If you could share your actual files you need to match, we could probably try to play around with these to get a faster solution with RM.    

  • SGolbert
    SGolbert New Altair Community Member

    Hi @asn4293,

     

    I may have found a solution playing around with Process Documents from data (from the Text Processing Extension):

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="r_scripting:execute_r" compatibility="8.1.000" expanded="true" height="103" name="Execute R" width="90" x="112" y="34">
    <parameter key="script" value="# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;rm_main = function()&#10;{&#10; &#10; data2 &lt;- data.table(Area = c(&quot;MArketing &amp; SaleS&quot;, &quot;Controlling/Accounting&quot;,&#10; &#9;&#9;&#9;&#9;&#9;&#9;&quot;Logistics&quot;))&#10;&#10; words = data.table(Match = c(&quot;sales&quot;, &quot;logistics&quot;))&#10; &#10; # connect 2 output ports to see the results&#10; return(list(data2, words))&#10;}&#10;"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="238">
    <list key="function_descriptions">
    <parameter key="OriginalText" value="Area"/>
    </list>
    </operator>
    <operator activated="true" class="remember" compatibility="8.1.003" expanded="true" height="68" name="Remember" width="90" x="380" y="34">
    <parameter key="name" value="match_words"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="238">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="vector_creation" value="Binary Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="Area" value="1.0"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="34"/>
    <operator activated="true" class="recall" compatibility="8.1.003" expanded="true" height="68" name="Recall" width="90" x="380" y="187">
    <parameter key="name" value="match_words"/>
    <parameter key="remove_from_store" value="false"/>
    </operator>
    <operator activated="true" class="operator_toolbox:filter_tokens_using_exampleset" compatibility="1.0.000" expanded="true" height="82" name="Filter Tokens Using ExampleSet" width="90" x="648" y="34">
    <parameter key="attribute" value="Match"/>
    <parameter key="invert_filter" value="true"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens Using ExampleSet" to_port="document"/>
    <connect from_op="Recall" from_port="result" to_op="Filter Tokens Using ExampleSet" to_port="example set"/>
    <connect from_op="Filter Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="715" y="238">
    <list key="function_descriptions">
    <parameter key="Match" value="if(text == &quot;&quot;, &quot;False&quot;, &quot;True&quot;)"/>
    </list>
    </operator>
    <connect from_op="Execute R" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Execute R" from_port="output 2" to_op="Remember" to_port="store"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Remember" from_port="stored" to_port="result 2"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Note that I generated a couple of test example sets with R, but that's only for my convenience (R is not at all necessary). The idea is to tokenize the string, then filter only the tokens matching the keywords and then proof whether the resulting string is empty.

     

    I leave it up to you to refactor this "quick and dirty" solution XD

     

    Kind regards,

    Sebastian

  • kypexin
    kypexin New Altair Community Member

    @SGolbert pretty neat! 

  • asn4293
    asn4293 New Altair Community Member
    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="r_scripting:execute_r" compatibility="8.1.000" expanded="true" height="103" name="Execute R" width="90" x="112" y="34">
    <parameter key="script" value="# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;rm_main = function()&#10;{&#10; &#10; data2 &lt;- data.table(Area = c(&quot;MArketing &amp; SaleS&quot;, &quot;Controlling/Accounting&quot;,&#10; &#9;&#9;&#9;&#9;&#9;&#9;&quot;Logistics&quot;))&#10;&#10; words = data.table(Match = c(&quot;sales&quot;, &quot;logistics&quot;))&#10; &#10; # connect 2 output ports to see the results&#10; return(list(data2, words))&#10;}&#10;"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="238">
    <list key="function_descriptions">
    <parameter key="OriginalText" value="Area"/>
    </list>
    </operator>
    <operator activated="true" class="remember" compatibility="8.1.003" expanded="true" height="68" name="Remember" width="90" x="380" y="34">
    <parameter key="name" value="match_words"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="238">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="vector_creation" value="Binary Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="Area" value="1.0"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="34"/>
    <operator activated="true" class="recall" compatibility="8.1.003" expanded="true" height="68" name="Recall" width="90" x="380" y="187">
    <parameter key="name" value="match_words"/>
    <parameter key="remove_from_store" value="false"/>
    </operator>
    <operator activated="true" class="operator_toolbox:filter_tokens_using_exampleset" compatibility="1.0.000" expanded="true" height="82" name="Filter Tokens Using ExampleSet" width="90" x="648" y="34">
    <parameter key="attribute" value="Match"/>
    <parameter key="invert_filter" value="true"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens Using ExampleSet" to_port="document"/>
    <connect from_op="Recall" from_port="result" to_op="Filter Tokens Using ExampleSet" to_port="example set"/>
    <connect from_op="Filter Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="715" y="238">
    <list key="function_descriptions">
    <parameter key="Match" value="if(text == &quot;&quot;, &quot;False&quot;, &quot;True&quot;)"/>
    </list>
    </operator>
    <connect from_op="Execute R" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Execute R" from_port="output 2" to_op="Remember" to_port="store"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Remember" from_port="stored" to_port="result 2"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    I dont know how to put R coding in can you please help to rectify it @SGolbert. One file has data in it, second file it is getting data from.

    Data file
    This file is the drop down which is data to look into