🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Removing HTTP Headers

User: "hawle087"
New Altair Community Member
Updated by Jocelyn
I'm trying to do some text analytics on a set of pre-downloaded html files but unfortunately they also include the HTTP headers (e.g. Content-type: text/html). I've tried using Remove Document Parts with regular expressions to strip out the headers before passing the document to Extract Content, but for some reason the Extract Content operator ignores the removals. To test this I setup a  simple process that takes a text file as input containing the words "one two three". The Remove Document Parts removes the word one (checked via breakpoint) but the final output includes it. Can anyone help me understand why Extract Content is ignoring the prior removal, or suggest some workarounds or alternate methods of removing HTTP headers from files?

Thanks.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
   <process expanded="true" height="460" width="899">
     <operator activated="true" class="text:process_document_from_file" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
       <list key="text_directories">
         <parameter key="test" value="C:\Users\XXX\test_files"/>
       </list>
       <process expanded="true" height="460" width="899">
         <operator activated="true" class="text:remove_document_parts" compatibility="5.2.001" expanded="true" height="60" name="RM One" width="90" x="45" y="30">
           <parameter key="deletion_regex" value="one"/>
         </operator>
         <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="179" y="30">
           <parameter key="minimum_text_block_length" value="3"/>
         </operator>
         <connect from_port="document" to_op="RM One" to_port="document"/>
         <connect from_op="RM One" from_port="document" to_op="Extract Content" to_port="document"/>
         <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
     <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
     <portSpacing port="sink_result 3" spacing="0"/>
   </process>
 </operator>
</process>
Updated:

As a workaround I used Replace Tokens after the Extract Content operator, though this is less than ideal for pattern matching.

Find more posts tagged with