XPath with "Cut Document" or "Extract Information" with "?"-result

miner

New Altair Community Member

Jan 19, 2018

Updated Nov 5, 2024 by Jocelyn

Dear RM-experts,

I´m struggling trying to extract certain information from websites I crawled.

My process is as follows:

I have a "Crawl web" operator crawling websites in a loop. This process works fine (tested with up to 17 iterations).

The crawled web pages are stored as html-files (one file for each site).

Now I want to get a specific information from these websites for which I have an XPath-statement, that works fine on google spreadsheet but not in RM. I tried the process with the recommended "Cut Document"-operator and with the "Extract Information"-operator within a "Process Documents from Files"-Process.

I already searched the forum and tried all possible versions of "//h:" and "assume html" - knowing that the syntax in RM is slightly different - but with no success.

Is anybody out there with a solution for this issue?

Here is my current process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <parameter key="logverbosity" value="all"/>
    <process expanded="true">
      <operator activated="false" class="concurrency:loop" compatibility="7.5.003" expanded="true" height="82" name="Loop" width="90" x="246" y="34">
        <parameter key="number_of_iterations" value="2"/>
        <parameter key="reuse_results" value="true"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="34">
            <parameter key="url" value="https://jobs.meinestadt.de/deutschland/suche?words=Zollabwicklung&amp;amp;page=%{iteration}#ms-jobs-result-list"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".+standard.+"/>
              <parameter key="follow_link_with_matching_url" value=".+standard.*"/>
            </list>
            <parameter key="retrieve_as_html" value="true"/>
            <parameter key="write_pages_to_disk" value="true"/>
            <parameter key="output_dir" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
            <parameter key="output_file_extension" value="%{iteration}.html"/>
            <parameter key="max_pages" value="20"/>
            <parameter key="max_page_size" value="100"/>
            <parameter key="delay" value="1000"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"/>
          </operator>
          <connect from_op="Crawl Web" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="246" y="136">
        <list key="text_directories">
          <parameter key="all" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="use_file_extension_as_type" value="false"/>
        <parameter key="content_type" value="html"/>
        <parameter key="encoding" value="UTF-8"/>
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Branche" value="//*[@id=&amp;quot;ms-maincontent&quot;]/div[1]/div[1]/div/div//h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2]/text()"/>
            </list>
            <list key="namespaces"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Thanks for your support.

Find more posts tagged with

🎉Community Raffle - Win $25

XPath with "Cut Document" or "Extract Information" with "?"-result

Find more posts tagged with

Quick Links