🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

XPath with "Cut Document" or "Extract Information" with "?"-result

User: "miner"
New Altair Community Member
Updated by Jocelyn

Dear RM-experts,

 

I´m struggling trying to extract certain information from websites I crawled.

My process is as follows:

 

I have a "Crawl web" operator crawling websites in a loop. This process works fine (tested with up to 17 iterations).

The crawled web pages are stored as html-files (one file for each site).

 

Now I want to get a specific information from these websites for which I have an XPath-statement, that works fine on google spreadsheet but not in RM. I tried the process with the recommended "Cut Document"-operator and with the "Extract Information"-operator within a "Process Documents from Files"-Process.

I already searched the forum and tried all possible versions of "//h:" and "assume html" - knowing that the syntax in RM is slightly different - but with no success.

Is anybody out there with a solution for this issue?

Here is my current process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<parameter key="logverbosity" value="all"/>
<process expanded="true">
<operator activated="false" class="concurrency:loop" compatibility="7.5.003" expanded="true" height="82" name="Loop" width="90" x="246" y="34">
<parameter key="number_of_iterations" value="2"/>
<parameter key="reuse_results" value="true"/>
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="34">
<parameter key="url" value="https://jobs.meinestadt.de/deutschland/suche?words=Zollabwicklung&amp;amp;page=%{iteration}#ms-jobs-result-list"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+standard.+"/>
<parameter key="follow_link_with_matching_url" value=".+standard.*"/>
</list>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="output_dir" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
<parameter key="output_file_extension" value="%{iteration}.html"/>
<parameter key="max_pages" value="20"/>
<parameter key="max_page_size" value="100"/>
<parameter key="delay" value="1000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"/>
</operator>
<connect from_op="Crawl Web" from_port="example set" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="246" y="136">
<list key="text_directories">
<parameter key="all" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="use_file_extension_as_type" value="false"/>
<parameter key="content_type" value="html"/>
<parameter key="encoding" value="UTF-8"/>
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Branche" value="//*[@id=&amp;quot;ms-maincontent&quot;]/div[1]/div[1]/div/div//h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2]/text()"/>
</list>
<list key="namespaces"/>
<parameter key="assume_html" value="false"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Thanks for your support.

Find more posts tagged with