Extract Information operator does ...not extract the information !

lionelderkrikor
lionelderkrikor New Altair Community Member
edited November 2024 in Community Q&A
Hi all,

It's report a weird behaviour of the Extract Information operator of the Text Processing extension : 
I have a very simple dataset (Excel file in attached file): 


I want to extract in the two last lines : 
 - the action following the word "perform" ie "BLUETOOTH_CONTROL_S4" and "BLUETOOTH_SOURCE_S4" in a new attribute called "action"
 - the number in the brackets ie "2" in a new attribute called "number". 

As a results, I'm obtaining the following dataset (the "number" attribute is empty) : 


I precise that my regex in the Extract Information parameters is correct : 

In conclusion I have no problem with my first attribute "action", the information is correctly extracted.
I precise that I performed a test with Generate Extract operator (see the 2nd screenshot) and with this operator, the number "2" is correctly extracted...

How can we explain this behaviour ?

Regards,

Lionel

NB : the process : 
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.2.000-SNAPSHOT" expanded="true" height="68" name="Read Excel" width="90" x="45" y="187">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Log_extraction\log_extraction.xlsx"/>
        <parameter key="sheet_selection" value="sheet number"/>
        <parameter key="sheet_number" value="1"/>
        <parameter key="imported_cell_range" value="A1"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="first_row_as_names" value="true"/>
        <list key="annotations"/>
        <parameter key="date_format" value=""/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="read_all_values_as_polynominal" value="false"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Att1.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.2.000-SNAPSHOT" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="187">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Att1.contains.Event"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
      </operator>
      <operator activated="true" class="split" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Split" width="90" x="313" y="187">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Att1"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="split_pattern" value="Event:"/>
        <parameter key="split_mode" value="ordered_split"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Nominal to Text" width="90" x="447" y="187">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" breakpoints="after" class="text:generate_extract" compatibility="8.1.000" expanded="true" height="68" name="Generate Extract" width="90" x="581" y="187">
        <parameter key="source_attribute" value="Att1_2"/>
        <parameter key="query_type" value="Regular Expression"/>
        <list key="string_machting_queries"/>
        <parameter key="attribute_type" value="Nominal"/>
        <list key="regular_expression_queries">
          <parameter key="number_generate_extract" value="\((.*?)\)"/>
        </list>
        <list key="regular_region_queries"/>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <parameter key="ignore_CDATA" value="true"/>
        <parameter key="assume_html" value="true"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="782" y="187">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="false"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="select_attributes_and_weights" value="false"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="false" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information (2)" width="90" x="246" y="238">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Numerical"/>
            <list key="regular_expression_queries">
              <parameter key="number" value="\((.*?)\)"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="179" y="34">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries">
              <parameter key="number" value="\((.*?)\)"/>
              <parameter key="action" value="(?&lt;=perform)(.*)(?==&gt;)"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Generate Extract" to_port="Example Set"/>
      <connect from_op="Generate Extract" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>



 

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answer

  • gmeier
    gmeier New Altair Community Member
    Answer ✓
    Hi Lionel,
    the problem seems to be that the whole line
    controller -- Device::UIProfiler::_handleEvent() for _UI-PROFILER_:   1262488101.460: perform  BLUETOOTH_CONTROL_S4 => connectDevice: xx:xx:xx:xx:xx:xx (2)
    is considered for the regex and the first matching is the () from handleEvent(). If you add a Replace operator with replace what = \(\)  and replace by = _  before Process Documents then the 2 is extracted. In the Generate Extract operator the handleEvent() is not present because the attribute Att1_2 is specified.


Answers

  • Maerkli
    Maerkli New Altair Community Member
    If Lionel cries for help, it should be damned complicated.
    Bonne chance, Lionel!
    Maerkli
  • Maerkli
    Maerkli New Altair Community Member
    Hallo Lionel,
    If I understand correctly, the query expression \((.*?)\) must produce 2 as result. I have replaced by \((.*?)\)S or \((.*?)\)W and I get ? under number in the tableau. Regex is not the issue. If you find the issue, is it possible to share with us?
    Merci,
    Maerkli






  • gmeier
    gmeier New Altair Community Member
    Answer ✓
    Hi Lionel,
    the problem seems to be that the whole line
    controller -- Device::UIProfiler::_handleEvent() for _UI-PROFILER_:   1262488101.460: perform  BLUETOOTH_CONTROL_S4 => connectDevice: xx:xx:xx:xx:xx:xx (2)
    is considered for the regex and the first matching is the () from handleEvent(). If you add a Replace operator with replace what = \(\)  and replace by = _  before Process Documents then the 2 is extracted. In the Generate Extract operator the handleEvent() is not present because the attribute Att1_2 is specified.


  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @gmeier,

    OK, that's totally logic ! I understand.
    Thanks you for your explanations.

    Regards,

    Lionel


Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.