[Solved]Syntax Xpath
I can't find the right syntax for Xpath tot extract data.
Right now I'm experimenting in google docs to find the richt syntax. I'm trying to pull the review text from the following url: http://www.tripadvisor.nl/ShowUserReviews-g188590-d2333086-r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS
With this syntax I get one specific review: //*[@id="review_155685828"]/text()
I want to extract all re reviews on that page, but I can't find the right syntax. Does anabody knows what synatax I have to use to retreive all the review text from that page?
Next step is to use the Xpath in rapidminer.
Thanxs, Arno
Right now I'm experimenting in google docs to find the richt syntax. I'm trying to pull the review text from the following url: http://www.tripadvisor.nl/ShowUserReviews-g188590-d2333086-r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS
With this syntax I get one specific review: //*[@id="review_155685828"]/text()
I want to extract all re reviews on that page, but I can't find the right syntax. Does anabody knows what synatax I have to use to retreive all the review text from that page?
Next step is to use the Xpath in rapidminer.
Thanxs, Arno
Find more posts tagged with
Sort by:
1 - 6 of
61
There is a X-Path function called "starts-with" with which you can get all paragraphs which start with 'review_'.
@id="REVIEWS"]//h
[starts-with(@id, "review_")]/text()"
P.S.: Do not forget that you have to use the 'h' namespace in RapidMiner, i.e. "//h:div[
Hi Marcin,
This is what I was looking for but couldn't figure out myself. So thank you very much. I tried to use it in Rapidminer but i don;'t get results. Do you know what I'm doing wrong?
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
<list key="text_directories">
<parameter key="All" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
</list>
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information" width="90" x="112" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="id="REVIEWS"" value="//h:div[@id=&quot;REVIEWS"]//h:p[starts-with(@id, "review_")]/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
P.S I added h:in Rapidminer
Thanks, Arno
This is what I was looking for but couldn't figure out myself. So thank you very much. I tried to use it in Rapidminer but i don;'t get results. Do you know what I'm doing wrong?
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
<list key="text_directories">
<parameter key="All" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
</list>
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information" width="90" x="112" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="id="REVIEWS"" value="//h:div[@id=&quot;REVIEWS"]//h:p[starts-with(@id, "review_")]/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
P.S I added h:in Rapidminer
Thanks, Arno
The problem is that you have one document and the "Extract Information" operator will take the first match it finds for every document. So you have to cut the document by your XPath (without text()) with the "Cut Document" operator. This will create several document for which you can extract the content separately. See the process below.
BTW: With the "Web Mining" extension you can crawl and process the HTML within a single RapidMiner process.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.009">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
<list key="text_directories">
<parameter key="All" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:cut_document" compatibility="5.3.000" expanded="true" height="60" name="Cut Document" width="90" x="246" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="review" value="//h:div[@id=&quot;REVIEWS"]//h:p[starts-with(@id, "review_")]"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information" width="90" x="313" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="review" value="//h:p[starts-with(@id, "review_")]/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="segment" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.3.009" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="review"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.3.009" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
<parameter key="attribute_name" value="review"/>
<list key="set_additional_roles"/>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>