XPath with "Cut Document" or "Extract Information" with "?"-result
Dear RM-experts,
I´m struggling trying to extract certain information from websites I crawled.
My process is as follows:
I have a "Crawl web" operator crawling websites in a loop. This process works fine (tested with up to 17 iterations).
The crawled web pages are stored as html-files (one file for each site).
Now I want to get a specific information from these websites for which I have an XPath-statement, that works fine on google spreadsheet but not in RM. I tried the process with the recommended "Cut Document"-operator and with the "Extract Information"-operator within a "Process Documents from Files"-Process.
I already searched the forum and tried all possible versions of "//h:" and "assume html" - knowing that the syntax in RM is slightly different - but with no success.
Is anybody out there with a solution for this issue?
Here is my current process:
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<parameter key="logverbosity" value="all"/>
<process expanded="true">
<operator activated="false" class="concurrency:loop" compatibility="7.5.003" expanded="true" height="82" name="Loop" width="90" x="246" y="34">
<parameter key="number_of_iterations" value="2"/>
<parameter key="reuse_results" value="true"/>
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="34">
<parameter key="url" value="https://jobs.meinestadt.de/deutschland/suche?words=Zollabwicklung&amp;page=%{iteration}#ms-jobs-result-list"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+standard.+"/>
<parameter key="follow_link_with_matching_url" value=".+standard.*"/>
</list>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="output_dir" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
<parameter key="output_file_extension" value="%{iteration}.html"/>
<parameter key="max_pages" value="20"/>
<parameter key="max_page_size" value="100"/>
<parameter key="delay" value="1000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"/>
</operator>
<connect from_op="Crawl Web" from_port="example set" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="246" y="136">
<list key="text_directories">
<parameter key="all" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="use_file_extension_as_type" value="false"/>
<parameter key="content_type" value="html"/>
<parameter key="encoding" value="UTF-8"/>
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Branche" value="//*[@id=&quot;ms-maincontent"]/div[1]/div[1]/div/div//h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2]/text()"/>
</list>
<list key="namespaces"/>
<parameter key="assume_html" value="false"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Thanks for your support.