Entity extraction - matching strings.

zacev
zacev New Altair Community Member
edited November 2024 in Community Q&A

Hello,

I just started out with Rapid Miner. I am interested in mining text documents concerning security vulnerabilities and exposures.

For instance, in reports concerning exposures there is always list of affected products. Is it possible to match a phrase with specific string? For instance a sentence(title) : AFFECTED PRODUCTS section has the following description : 

The following Philips XperIM Connect versions are affected:
- XperIM Connect system running Windows XP, Version 1.5.12 and prior versions.

I have successfully learned the basics of text processing and I would like to move on in order to solve this problem, as a result I would like to print out somehow The name of affected products and series.

 

Thanks for any possible hints.

Tagged:

Best Answers

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    Hi,

     

    I am not 100% sure if I got you right but do you want to extract the product names and make those available as an extra attribute?  And the sentences all follow the same pattern of "...are affected: (product name).."?

     

    If I got you right, then the operator Replace with using regular expressions and capturing groups will be the solution.  Regular expressions are a somewhat complext topic but it is worth to get into them if you are serious with text analytics but also in general with more complex data preparation tasks.

     

    There are some online tutorials.  A quick search brought up this one which looked decent on a first sight: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

     

    Cheers,

    Ingo

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    Hi,

     

    Yes, this might be a possible solution.  You might also want to check out the extension from Aylien to see if this helps.

     

    My suggestion was actually much simpler than training an entity extraction model (which might indeed be necessary).  I was just suggesting if the text all follow the same structure, that just using regular expressions and replace could do the trick already.

     

    Here is a process to show you what I mean:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
    <list key="attribute_values">
    <parameter key="Text" value="&quot;The following Philips XperIM Connect versions are affected: - XperIM Connect system running Windows XP, Version 1.5.12 and prior versions.&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_copy" compatibility="7.2.000" expanded="true" height="82" name="Generate Copy" width="90" x="246" y="34">
    <parameter key="attribute_name" value="Text"/>
    <parameter key="new_name" value="Product"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.2.000" expanded="true" height="82" name="Replace" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Product"/>
    <parameter key="replace_what" value="The following (.*) versions are affected:.*"/>
    <parameter key="replace_by" value="$1"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Copy" to_port="example set input"/>
    <connect from_op="Generate Copy" from_port="example set output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    So you have a couple of options to explore now :smileyvery-happy:

     

    Cheers,

    Ingo

Answers

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    Hi,

     

    I am not 100% sure if I got you right but do you want to extract the product names and make those available as an extra attribute?  And the sentences all follow the same pattern of "...are affected: (product name).."?

     

    If I got you right, then the operator Replace with using regular expressions and capturing groups will be the solution.  Regular expressions are a somewhat complext topic but it is worth to get into them if you are serious with text analytics but also in general with more complex data preparation tasks.

     

    There are some online tutorials.  A quick search brought up this one which looked decent on a first sight: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

     

    Cheers,

    Ingo

  • zacev
    zacev New Altair Community Member

    Hi,

    More precisely I would like to extract information, that has value for the end user. So instead of reading the whole document let the user mine several reports and get the affected product names as you mentioned. Would you expand the possible solution in RapidMiner a little bit, thus is it possible to get results using only RM?

     

    Edit: I've just discovered a plugin called information extraction for RM, there are several articles about it, maybe that would be an interesting solution too?

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    Hi,

     

    Yes, this might be a possible solution.  You might also want to check out the extension from Aylien to see if this helps.

     

    My suggestion was actually much simpler than training an entity extraction model (which might indeed be necessary).  I was just suggesting if the text all follow the same structure, that just using regular expressions and replace could do the trick already.

     

    Here is a process to show you what I mean:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
    <list key="attribute_values">
    <parameter key="Text" value="&quot;The following Philips XperIM Connect versions are affected: - XperIM Connect system running Windows XP, Version 1.5.12 and prior versions.&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_copy" compatibility="7.2.000" expanded="true" height="82" name="Generate Copy" width="90" x="246" y="34">
    <parameter key="attribute_name" value="Text"/>
    <parameter key="new_name" value="Product"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.2.000" expanded="true" height="82" name="Replace" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Product"/>
    <parameter key="replace_what" value="The following (.*) versions are affected:.*"/>
    <parameter key="replace_by" value="$1"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Copy" to_port="example set input"/>
    <connect from_op="Generate Copy" from_port="example set output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    So you have a couple of options to explore now :smileyvery-happy:

     

    Cheers,

    Ingo

  • Robin1992
    Robin1992 New Altair Community Member
    Hi, I have a similar problem but still seeking for a solution... do you have the final model for me? that you produced  in rapid miner
  • Telcontar120
    Telcontar120 New Altair Community Member
    I would recommend using Entity Extraction operators from either Rosette or Aylien.
  • sgenzer
    sgenzer
    Altair Employee
    agree with @Telcontar120. I will say I now guide everyone to Rosette rather than Aylien. Aylien no longer supports their extension and it has a high error rate (i.e. numerous bugs, not user errors).

    Scott