Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

[Solved]Syntax Xpath

I can't find the right syntax for Xpath tot extract data.

Right now I'm experimenting in google docs to find the richt syntax. I'm trying to pull the review text from the following url: http://www.tripadvisor.nl/ShowUserReviews-g188590-d2333086-r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS

With this syntax I get one specific review: //*[@id="review_155685828"]/text()

I want to extract all re reviews on that page, but I can't find the right syntax. Does anabody knows what synatax I have to use to retreive all the review text from that page?

Next step is to use the Xpath in rapidminer.

Thanxs, Arno

Find more posts tagged with

AI Studio

Accepted answers

All comments

Skirzynski

There is a X-Path function called "starts-with" with which you can get all paragraphs which start with 'review_'.


//div[@id="REVIEWS"]//p[starts-with(@id, "review_")]/text()

P.S.: Do not forget that you have to use the 'h' namespace in RapidMiner, i.e. "//h:div[@id="REVIEWS"]//h

[starts-with(@id, "review_")]/text()"

ArnoG

Hi Marcin,
This is what I was looking for but couldn't figure out myself. So thank you very much. I tried to use it in Rapidminer but i don;'t get results. Do you know what I'm doing wrong?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
<list key="text_directories">
<parameter key="All" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
</list>
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information" width="90" x="112" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="id="REVIEWS"" value="//h:div[@id=&quot;REVIEWS"]//h:p[starts-with(@id, "review_")]/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

P.S I added h:in Rapidminer

Thanks, Arno

Skirzynski

Deactivate the "extract_text_only" parameter in the "Process Documents from Files" operator. Otherwise all HTML-tags will be removed before extraction which means that XPath is worthless.

ArnoG

Thanks Marcin,
Much better.

. The only thing is that by using the Xpath syntax of rapidminer I get 1 review and using the same syntax in Google Docs I get all 6 reviews. Do you know how that is possine?

Thanks, Arno

Skirzynski

The problem is that you have one document and the "Extract Information" operator will take the first match it finds for every document. So you have to cut the document by your XPath (without text()) with the "Cut Document" operator. This will create several document for which you can extract the content separately. See the process below.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.009">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="All" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:cut_document" compatibility="5.3.000" expanded="true" height="60" name="Cut Document" width="90" x="246" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="review" value="//h:div[@id=&amp;quot;REVIEWS&quot;]//h:p[starts-with(@id, &quot;review_&quot;)]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true">
              <operator activated="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information" width="90" x="313" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="review" value="//h:p[starts-with(@id, &quot;review_&quot;)]/text()"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.3.009" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="review"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.009" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
        <parameter key="attribute_name" value="review"/>
        <list key="set_additional_roles"/>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

BTW: With the "Web Mining" extension you can crawl and process the HTML within a single RapidMiner process.

ArnoG

Hi Marcin, Thanks al lot. That works fine!