Reg Eexp Not Working as I Expected in Generate Attributes Operator

mikeb
mikeb New Altair Community Member
edited November 5 in Community Q&A
I am trying to use a regular expression in the Generate Attributes operator to find a portion of text that contains a date.  I want to use the index() function to find the start of the text block containing the date.  The text block always looks something like this:
pn - 2013-03-21
and it always starts on a new line and the line ends right after the date.
index(text, "\nad ")
works to find the start of the text block.

However, I'd like to have something more robust that specifies the date format, to make sure I don't pick up any old line of text that starts with "ad ".  So I tried:
index(text,"ad.{3}20[0-1][0-9]-[0-9]{2}-[0-9]{2}")
and it finds no match in Rapidminer.  But if I use the same expression in Expresso, it does find a match in a text sample like:
Blah blah
Innovation Export
ad - 2013-03-21
pd - 2011-20-32
blah, blah
done

We also tried the same sort of reg exp with the Generate Extract operator and that did not find the matching text either.

What am I doing wrong?

Answers

  • Hello

    Here's an example that uses Generate Extract. I used the '^' method within the regular expression to specify the start of the string.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="5.3.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="30">
            <list key="attribute_values">
              <parameter key="text" value="&quot;ad - 2013-03-21&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="5.3.008" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="120">
            <list key="attribute_values">
              <parameter key="text" value="&quot;pd - 2013-03-21&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="5.3.008" expanded="true" height="60" name="Generate Data by User Specification (3)" width="90" x="45" y="210">
            <list key="attribute_values">
              <parameter key="text" value="&quot;blank&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="5.3.008" expanded="true" height="60" name="Generate Data by User Specification (4)" width="90" x="45" y="300">
            <list key="attribute_values">
              <parameter key="text" value="&quot;innovation&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="append" compatibility="5.3.008" expanded="true" height="130" name="Append" width="90" x="246" y="120"/>
          <operator activated="true" class="text:generate_extract" compatibility="5.3.001" expanded="true" height="60" name="Generate Extract" width="90" x="380" y="120">
            <parameter key="source_attribute" value="text"/>
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="datePart" value="^[a-z]{2} - (20[0-1][0-9]-[0-9]{2}-[0-9]{2})"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
          <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append" to_port="example set 4"/>
          <connect from_op="Append" from_port="merged set" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • Nils_Woehler
    Nils_Woehler New Altair Community Member
    The index() methods in Generate Attributes uses the index() function provided by Java. The index() function of Java expects just a string, not a regular expression.

    Here is the documentation from Java 1.7:

    Returns the index within this string of the first occurrence of the specified substring.

    The returned index is the smallest value k for which:

    this.startsWith(str, k)

    If no such value of k exists, then -1 is returned.
    Best,
    Nils

  • mikeb
    mikeb New Altair Community Member
    Thanks for the replies Andrew and Nils. 

    Andrew, unfortunately your example process does not seem to work for me if there is additional text in the text attribute besides the date strings you have in your first two examples.  So I'm still having some trouble, but will play around with it some more. 

    Nils, thanks for letting me know about the index() limitation.  It would be nice to add that limitation to the Generate Attribute Help documentation.
  • Hello mikeb

    Could you post some more example data so we could fit some regular expressions to it? I'm not a Regular Expression Ninja but I'm working at it.

    regards

    Andrew