Text Processing with Document type (how to use modified output?)

colo
colo New Altair Community Member
edited November 5 in Community Q&A
Hello everybody,

I want to do some text processing with the Document type. In a simple example I use "Read Document" to access a formerly crawled and stored web page (html file). The content shall be filtered and inspected with some regular expressions. For the beginning I just added the "Keep document parts" operator to discard everything but the <body>...</body> part. The Document output shows the desired modified content in the upper window. This is the part I need for further text processing but some operators seem to always work on the original document. For example a following "Extract information" with a regex "<head>" finds this content. Looking for other content which becomes available through filtering and transformation (left out in my simple example explained above) can never be found. "Write Document" also generates the original text ignoring all changes to Document made in my operator chain.

This results in my simple but important question: how to work with the modified document?

Thanks in advance!

Answers

  • haddock
    haddock New Altair Community Member

    We meet again! Sailing similar waters I suspect..
    how to work with the modified document?
    It comes down to whether Rapido is passing its normal Data input/output, or Documents, which are a special type of I/O object. Sometimes you have an example set which needs handling by document handlers in which case you use the 'Process Documents from Data' operator, and so on. Here's a Beeb news title grabber.. not swift.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="339" width="795">
          <operator activated="true" class="web:read_rss" expanded="true" height="60" name="Read RSS Feed" width="90" x="112" y="210">
            <parameter key="url" value="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml"/>
            <parameter key="random_user_agent" value="true"/>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" expanded="true" height="60" name="Get Pages" width="90" x="313" y="120">
            <parameter key="link_attribute" value="Link"/>
            <parameter key="page_attribute" value="text"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="488" y="119">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true" height="353" width="808">
              <operator activated="true" class="text:extract_length" expanded="true" height="60" name="Extract Length" width="90" x="188" y="47"/>
              <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="386" y="45">
                <parameter key="query_type" value="Regular Region"/>
                <list key="string_machting_queries">
                  <parameter key="Title" value="&lt;Body&gt;.&lt;/Body&gt;"/>
                </list>
                <list key="regular_expression_queries">
                  <parameter key="Title" value="&lt;%title%[^&gt;]*&gt;(.*?)&lt;/%title%&gt;"/>
                </list>
                <list key="regular_region_queries">
                  <parameter key="Title" value="&lt;title&gt;.&lt;/title&gt;"/>
                </list>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Length" to_port="document"/>
              <connect from_op="Extract Length" from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="623" y="115">
            <parameter key="name" value="text"/>
          </operator>
          <connect from_op="Read RSS Feed" from_port="output" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    PS Probably best to post code as above when it gets down to the detail.