🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Extract Information operator does ...not extract the information !

User: "lionelderkrikor"
New Altair Community Member
Updated by Jocelyn
Hi all,

It's report a weird behaviour of the Extract Information operator of the Text Processing extension : 
I have a very simple dataset (Excel file in attached file): 


I want to extract in the two last lines : 
 - the action following the word "perform" ie "BLUETOOTH_CONTROL_S4" and "BLUETOOTH_SOURCE_S4" in a new attribute called "action"
 - the number in the brackets ie "2" in a new attribute called "number". 

As a results, I'm obtaining the following dataset (the "number" attribute is empty) : 


I precise that my regex in the Extract Information parameters is correct : 

In conclusion I have no problem with my first attribute "action", the information is correctly extracted.
I precise that I performed a test with Generate Extract operator (see the 2nd screenshot) and with this operator, the number "2" is correctly extracted...

How can we explain this behaviour ?

Regards,

Lionel

NB : the process : 
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.2.000-SNAPSHOT" expanded="true" height="68" name="Read Excel" width="90" x="45" y="187">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Log_extraction\log_extraction.xlsx"/>
        <parameter key="sheet_selection" value="sheet number"/>
        <parameter key="sheet_number" value="1"/>
        <parameter key="imported_cell_range" value="A1"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="first_row_as_names" value="true"/>
        <list key="annotations"/>
        <parameter key="date_format" value=""/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="read_all_values_as_polynominal" value="false"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Att1.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.2.000-SNAPSHOT" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="187">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Att1.contains.Event"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
      </operator>
      <operator activated="true" class="split" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Split" width="90" x="313" y="187">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Att1"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="split_pattern" value="Event:"/>
        <parameter key="split_mode" value="ordered_split"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Nominal to Text" width="90" x="447" y="187">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" breakpoints="after" class="text:generate_extract" compatibility="8.1.000" expanded="true" height="68" name="Generate Extract" width="90" x="581" y="187">
        <parameter key="source_attribute" value="Att1_2"/>
        <parameter key="query_type" value="Regular Expression"/>
        <list key="string_machting_queries"/>
        <parameter key="attribute_type" value="Nominal"/>
        <list key="regular_expression_queries">
          <parameter key="number_generate_extract" value="\((.*?)\)"/>
        </list>
        <list key="regular_region_queries"/>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <parameter key="ignore_CDATA" value="true"/>
        <parameter key="assume_html" value="true"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="782" y="187">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="false"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="select_attributes_and_weights" value="false"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="false" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information (2)" width="90" x="246" y="238">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Numerical"/>
            <list key="regular_expression_queries">
              <parameter key="number" value="\((.*?)\)"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="179" y="34">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries">
              <parameter key="number" value="\((.*?)\)"/>
              <parameter key="action" value="(?&lt;=perform)(.*)(?==&gt;)"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Generate Extract" to_port="Example Set"/>
      <connect from_op="Generate Extract" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>



 

Find more posts tagged with

Sort by:
1 - 4 of 41
    User: "Maerkli"
    New Altair Community Member
    If Lionel cries for help, it should be damned complicated.
    Bonne chance, Lionel!
    Maerkli
    User: "Maerkli"
    New Altair Community Member
    Hallo Lionel,
    If I understand correctly, the query expression \((.*?)\) must produce 2 as result. I have replaced by \((.*?)\)S or \((.*?)\)W and I get ? under number in the tableau. Regex is not the issue. If you find the issue, is it possible to share with us?
    Merci,
    Maerkli






    User: "gmeier"
    New Altair Community Member
    Accepted Answer
    Hi Lionel,
    the problem seems to be that the whole line
    controller -- Device::UIProfiler::_handleEvent() for _UI-PROFILER_:   1262488101.460: perform  BLUETOOTH_CONTROL_S4 => connectDevice: xx:xx:xx:xx:xx:xx (2)
    is considered for the regex and the first matching is the () from handleEvent(). If you add a Replace operator with replace what = \(\)  and replace by = _  before Process Documents then the 2 is extracted. In the Generate Extract operator the handleEvent() is not present because the attribute Att1_2 is specified.


    User: "lionelderkrikor"
    New Altair Community Member
    OP
    Hi @gmeier,

    OK, that's totally logic ! I understand.
    Thanks you for your explanations.

    Regards,

    Lionel