[Solved] XPath queries are empty
Legacy User
New Altair Community Member
Hi there, I am trying to extract text information from http://www.tripadvisor.com/ShowTopic-g29220-i86-k1487815-Alamo-Maui_Hawaii.html using the Get Page and Process Documents with the extract Information Subprocess.
The query result however is empty no matter what I try. Has anyone an idea?
here the Process Code:
Thank you very much in advance. ;D
The query result however is empty no matter what I try. Has anyone an idea?
here the Process Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
<parameter key="url" value="http://www.tripadvisor.com/ShowTopic-g29220-i86-k1487815-Alamo-Maui_Hawaii.html"/>
<parameter key="random_user_agent" value="true"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="380" y="30">
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="xpath1" value="//div[@class='postBody']"/>
<parameter key="xpath2" value="//div[@class='postBody']/text()"/>
<parameter key="xpath3" value="//div[@class='postBody']/p[not(*)][text()]"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Thank you very much in advance. ;D
Tagged:
0
Answers
-
Does anyone have an Idea please? I Have the feeling I am very close to the solution but I am missing something.
My Problem seems to be quite simmilar to the one discussed here: http://rapid-i.com/rapidforum/index.php/topic,7753.0.html but I just dont get it working for me.
???0 -
thank you very much for your reply.
It seems like I am getting closer to my goal.
Now I think only my XPath query is not completely correct.
With th query: //h:div[@class='postBody'][not(contains(.,'http://www.'))]
I get the following output:<div xmlns="http://www.w3.org/1999/xhtml" class="postBody">
This is already a very good result. But how do I get rid of the last bits of HTML-Tags? And why do I have to add the namespace classifier exactly?
<div id="pst_adm_9020974" />
<div id="top_adm_1487815" />
<div id="usr_adm_tgienger" />
<p>Just a quick comment on Alamo car rentals. Was just out there last week and had a convertible from Alamo. Got it thru priceline and was really concerned, based on all the "bad press" that Alamo has gotten on here. To my surprise, had zero problems with Alamo. Got a nearly-new Sebring conv, with 3000 miles on it.</p>
<p />
<p />
<p />
<p>Dreaded the waiting-in-line, but had no problems there either. In and out in short-order. Probably less than 10 minutes either day.</p>
<p />
<p />
<p />
<p>I did have a problem with the Sebring, but it had nothing to do with Alamo. Seems that Chrysler, in their infinite wisdom, decided to have the conv top take up space in the trunk. That works fine when the trunk is empty. But one night we forgot the beach chairs in the trunk. And when we put the top down the next morning, it shattered the back glass! Seems to me that Chrysler could have done a better job designing the conv!</p>
<p />
<p />
<p />
<p>Still waiting to hear back from Alamo what that's going to cost me (and my insurance)...</p>
</div>
The XML now is:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
<parameter key="url" value="http://www.tripadvisor.com/ShowTopic-g29220-i86-k1487815-Alamo-Maui_Hawaii.html"/>
<parameter key="random_user_agent" value="true"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="76" name="Multiply" width="90" x="179" y="75"/>
<operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents (2)" width="90" x="380" y="75">
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information (2)" width="90" x="380" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries">
<parameter key="extract" value="<p>+.</p>"/>
</list>
<list key="xpath_queries">
<parameter key="xpath1" value="//h:div[@class='postBody'][not(contains(.,'http://www.'))]"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Process Documents (2)" to_port="documents 1"/>
<connect from_op="Process Documents (2)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Again, thank you very much for your help0 -
I Just fount the solution!
Thank you for your help.
The XPath query has to be: string(//h:div[@class='postBody'][not(contains(.,'http://www.'))])0