Using "Cut Document" Operator neglects numbers and punctuation in HTML text

Limegreenman900
Limegreenman900 New Altair Community Member
edited November 5 in Community Q&A
Hi everyone,

I am currently using the "Cut Document" Operator with query type "Regular Region" to extract specific text out of locally stored  HTML files.
This works pretty good so far, however it seems as all numbers in the text are being neglected.

i.e. Original Text:
<td style=" width:100.00%; text-align:justify; " class="ta_10"><span class="ta_10">Companies Act 2006. Our audit work has been undertaken so that we might state to the company's members those</span></td>
<td style=" width:100.00%; text-align:justify; " class="ta_10"><span class="ta_10">concerning the cost of the fixed asset investment, stated at £51,925 in note 6  to the financial statements.</span></td>

Text after extraction:
Companies Act Our audit work has been undertaken so that we might state to the company s members those
concerning the cost of the fixed asset investment stated at  in note to the financial statements

Also punctuation characters like , and . are neglected. Anyone has an idea if there is a setting to get both, punctuation characters and numbers?

My code right now looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="112" y="30">
       <parameter key="file" value="C:\Users\Independent Auditors Report\Prod224_0010_00178176_20131231.html"/>
       <parameter key="extract_text_only" value="false"/>
     </operator>
     <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="246" y="30">
       <parameter key="query_type" value="Regular Region"/>
       <list key="string_machting_queries"/>
       <list key="regular_expression_queries"/>
       <list key="regular_region_queries">
         <parameter key="Independent Report" value="(?i)(&gt;[^&gt;]+Independent Auditors(')? to[^&lt;]+&lt;).name=&quot;[^&quot;]+NameSeniorStatutoryAuditor&quot;"/>
       </list>
       <list key="xpath_queries"/>
       <list key="namespaces"/>
       <list key="index_queries"/>
       <process expanded="true">
         <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.002" expanded="true" height="60" name="Extract Content (2)" width="90" x="112" y="30">
           <parameter key="minimum_text_block_length" value="3"/>
         </operator>
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="313" y="30"/>
         <operator activated="true" class="text:extract_token_number" compatibility="5.3.002" expanded="true" height="60" name="Extract Token Number" width="90" x="514" y="30"/>
         <connect from_port="segment" to_op="Extract Content (2)" to_port="document"/>
         <connect from_op="Extract Content (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
         <connect from_op="Tokenize (2)" from_port="document" to_op="Extract Token Number" to_port="document"/>
         <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
         <portSpacing port="source_segment" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="120">
       <list key="text_directories">
         <parameter key="test" value="C:\Users\ndependent Auditors Report\Teil 1"/>
       </list>
       <process expanded="true">
         <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.002" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
         <connect from_port="document" to_op="Extract Content" to_port="document"/>
         <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
         <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Read Document" from_port="output" to_op="Cut Document" to_port="document"/>
     <connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

Answers

  • Limegreenman900
    Limegreenman900 New Altair Community Member
    Ok, it looks like that it has been due to my "Tokenize" Operator I used in "Cut Documents". If I am using my process without it I get plain text with punctuation and numbers.

    If I use "linguistic tokens - english" as setting in the tokenize operator it works perfectly.