Problem with Xpath query? Processing documents from web
pix123
New Altair Community Member
Hi there,
I am trying to extract documents from a movie review site. When I run the process below I get 0 results but can't figure out the problem, can anyone help? Thanks.
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process"><br> <process expanded="true"><br> <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="313" y="238"><br> <parameter key="number_of_iterations" value="10"/><br> <process expanded="true"><br> <operator activated="true" class="web:process_web_modern" compatibility="9.0.000" expanded="true" height="68" name="Process Documents from Web" width="90" x="179" y="85"><br> <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/"/><br> <list key="crawling_rules"/><br> <process expanded="true"><br> <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="246" y="34"><br> <parameter key="query_type" value="XPath"/><br> <list key="string_machting_queries"/><br> <list key="regular_expression_queries"/><br> <list key="regular_region_queries"/><br> <list key="xpath_queries"><br> <parameter key="seg" value="//h:table[@class='table table-striped']/h:tr"/><br> </list><br> <list key="namespaces"/><br> <list key="index_queries"/><br> <list key="jsonpath_queries"/><br> <process expanded="true"><br> <connect from_port="segment" to_port="document 1"/><br> <portSpacing port="source_segment" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="34"><br> <parameter key="query_type" value="XPath"/><br> <list key="string_machting_queries"/><br> <list key="regular_expression_queries"/><br> <list key="regular_region_queries"/><br> <list key="xpath_queries"><br> <parameter key="text" value="//h:p/text|)"/><br> </list><br> <list key="namespaces"/><br> <list key="index_queries"/><br> <list key="jsonpath_queries"/><br> </operator><br> <connect from_port="document" to_op="Cut Document" to_port="document"/><br> <connect from_op="Cut Document" from_port="documents" to_port="document 1"/><br> <portSpacing port="source_document" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <connect from_op="Process Documents from Web" from_port="example set" to_port="output 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_output 1" spacing="0"/><br> <portSpacing port="sink_output 2" spacing="0"/><br> </process><br> </operator><br> <connect from_op="Loop" from_port="output 1" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
Tagged:
0
Best Answer
-
Bit hard to explain, but what you do is as follows :
You select the reviews with a loop logic, the translation of the xpath used is a bit like 'give me the text of every review that has a class called 'the_review:neutral
But then you take the xpath for the first match of each attribute, but this doesn't give the right result as every review has this data on a different location, relative to the actual review, so you do not map these together. With the current structure you loose all relation between the data, and what you need is more like
For every div containing a review, get me the parameters (reviewer, date etc) that are part of this div.
(told you it was hard to explain )
Long story short, I'm not sure you can get this with the typical xpath extractor, but you can use xpath directly with the xslt operators.
I've attached an example, it's a bit more complex but still relatively easy to adapt.
The logic is to create a proper xml from the htm first (the code is not xhtml) and then use dedicated xpath, this returns a nice table with your data ready to use<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="179" y="34"> <parameter key="number_of_iterations" value="10"/> <parameter key="enable_parallel_execution" value="false"/> <process expanded="true"> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="289"> <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}"/> <list key="query_parameters"/> <list key="request_properties"/> </operator> <operator activated="true" class="text:html_to_xml" compatibility="8.1.000" expanded="true" height="68" name="HTML to XML" width="90" x="179" y="289"/> <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="313" y="289"> <list key="replace_dictionary"> <parameter key="(?s)^.*?<html.*?>" value="<html>"/> </list> </operator> <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="179" y="646"> <parameter key="text" value="<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 	<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> 	<xsl:template match="/"> 		<root> 			<xsl:for-each select="//div[@class='row review_table_row']"> 				<!--<xsl:copy-of select="."/>--> 				<row 				critic="{normalize-space(.//div[contains(@class,'critic_name')]/a[1])}" 				publisher="{normalize-space(.//div[contains(@class,'critic_name')]/a[2]/em)}" 				date="{normalize-space(.//div[contains(@class,'review_date')])}" 				review="{normalize-space(.//div[@class='the_review'])}" 				score="{normalize-space(.//div[@class='small subtle'][contains(.,'Original Score')])}"/> 			</xsl:for-each> 		</root> 	</xsl:template> </xsl:stylesheet>"/> </operator> <operator activated="true" class="text:process_xslt" compatibility="8.1.000" expanded="true" height="82" name="Process XSLT" width="90" x="313" y="544"/> <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="447" y="544"> <parameter key="query_type" value="Regular Region"/> <list key="string_machting_queries"> <parameter key="review" value="<row./>"/> </list> <list key="regular_expression_queries"/> <list key="regular_region_queries"> <parameter key="review" value="<row./>"/> </list> <list key="xpath_queries"/> <list key="namespaces"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34"> <parameter key="query_type" value="XPath"/> <list key="string_machting_queries"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"/> <list key="xpath_queries"> <parameter key="Critic Name" value="//@critic"/> <parameter key="Reviews" value="//@review"/> <parameter key="Date Posted" value="//@date"/> <parameter key="Publisher" value="//@publisher"/> <parameter key="Score" value="//@score"/> </list> <list key="namespaces"/> <parameter key="ignore_CDATA" value="false"/> <parameter key="assume_html" value="false"/> <list key="index_queries"/> <list key="jsonpath_queries"/> </operator> <connect from_port="segment" to_op="Extract Information" to_port="document"/> <connect from_op="Extract Information" from_port="document" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="544"> <parameter key="text_attribute" value="content"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="391"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attributes" value="content|query_key"/> <parameter key="invert_selection" value="true"/> </operator> <connect from_op="Get Page" from_port="output" to_op="HTML to XML" to_port="document"/> <connect from_op="HTML to XML" from_port="document" to_op="Replace Tokens" to_port="document"/> <connect from_op="Replace Tokens" from_port="document" to_op="Process XSLT" to_port="document"/> <connect from_op="Create Document" from_port="output" to_op="Process XSLT" to_port="xslt document"/> <connect from_op="Process XSLT" from_port="document" to_op="Cut Document" to_port="document"/> <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/> <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="store" compatibility="9.0.003" expanded="true" height="68" name="Store" width="90" x="313" y="34"> <parameter key="repository_entry" value="New Output of Web Pages/RT Reviews"/> </operator> <connect from_op="Loop" from_port="output 1" to_op="Store" to_port="input"/> <connect from_op="Store" from_port="through" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
5
Answers
-
Are you sure about the page or the used path?
The reviews are not in a table but in a div, the used logic is looking for a table but that is not existing (table-striped cannot be found in the source code)
This is how a review is stored, using a div with class 'the_review'.
<div class="the_review">
This is a lovely, funny, wonderfully acted film. The big problem is, it's an 80-minute movie that takes two hours. By the time you get to the real story, you're out of gas.
</div>
so try with<parameter key="seg" value="//h:div[@class='the_review']"/>
It's untested, so don't take it for granted :-)
What could have happened is that you tested the site during an A/B test, or that the page code is different depending on the agent used by Rapidminer.
0 -
@kayman Thank you, this helped a lot, it had been a while since I had worked with the process.I've now got most of the xPath attributes up and running but can not retrieve the score for each. I get a question mark when I run the process, all other attributes are ok. Any ideas?<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="179" y="34">
<parameter key="number_of_iterations" value="10"/>
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
<parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="380" y="34">
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="85">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Review" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[2]/h:div[1]/text() "/>
<parameter key="Date Posted" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[1]/text()"/>
<parameter key="Publisher" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[1]/h:div[3]/h:a[2]/h:em/text()"/>
<parameter key="Score" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[2]/h:div[2]/text"/>
<parameter key="Critic Name" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[1]/h:div[3]/h:a[1]/text() "/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="447" y="85">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Feedback_text" value="//h:div[@class='the_review']/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<process expanded="true">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="store" compatibility="9.0.003" expanded="true" height="68" name="Store" width="90" x="1251" y="85">
<parameter key="repository_entry" value="New Output of Web Pages/RT Reviews"/>
</operator>
<connect from_op="Loop" from_port="output 1" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0 -
Bit hard to explain, but what you do is as follows :
You select the reviews with a loop logic, the translation of the xpath used is a bit like 'give me the text of every review that has a class called 'the_review:neutral
But then you take the xpath for the first match of each attribute, but this doesn't give the right result as every review has this data on a different location, relative to the actual review, so you do not map these together. With the current structure you loose all relation between the data, and what you need is more like
For every div containing a review, get me the parameters (reviewer, date etc) that are part of this div.
(told you it was hard to explain )
Long story short, I'm not sure you can get this with the typical xpath extractor, but you can use xpath directly with the xslt operators.
I've attached an example, it's a bit more complex but still relatively easy to adapt.
The logic is to create a proper xml from the htm first (the code is not xhtml) and then use dedicated xpath, this returns a nice table with your data ready to use<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="179" y="34"> <parameter key="number_of_iterations" value="10"/> <parameter key="enable_parallel_execution" value="false"/> <process expanded="true"> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="289"> <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}"/> <list key="query_parameters"/> <list key="request_properties"/> </operator> <operator activated="true" class="text:html_to_xml" compatibility="8.1.000" expanded="true" height="68" name="HTML to XML" width="90" x="179" y="289"/> <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="313" y="289"> <list key="replace_dictionary"> <parameter key="(?s)^.*?<html.*?>" value="<html>"/> </list> </operator> <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="179" y="646"> <parameter key="text" value="<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 	<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> 	<xsl:template match="/"> 		<root> 			<xsl:for-each select="//div[@class='row review_table_row']"> 				<!--<xsl:copy-of select="."/>--> 				<row 				critic="{normalize-space(.//div[contains(@class,'critic_name')]/a[1])}" 				publisher="{normalize-space(.//div[contains(@class,'critic_name')]/a[2]/em)}" 				date="{normalize-space(.//div[contains(@class,'review_date')])}" 				review="{normalize-space(.//div[@class='the_review'])}" 				score="{normalize-space(.//div[@class='small subtle'][contains(.,'Original Score')])}"/> 			</xsl:for-each> 		</root> 	</xsl:template> </xsl:stylesheet>"/> </operator> <operator activated="true" class="text:process_xslt" compatibility="8.1.000" expanded="true" height="82" name="Process XSLT" width="90" x="313" y="544"/> <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="447" y="544"> <parameter key="query_type" value="Regular Region"/> <list key="string_machting_queries"> <parameter key="review" value="<row./>"/> </list> <list key="regular_expression_queries"/> <list key="regular_region_queries"> <parameter key="review" value="<row./>"/> </list> <list key="xpath_queries"/> <list key="namespaces"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34"> <parameter key="query_type" value="XPath"/> <list key="string_machting_queries"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"/> <list key="xpath_queries"> <parameter key="Critic Name" value="//@critic"/> <parameter key="Reviews" value="//@review"/> <parameter key="Date Posted" value="//@date"/> <parameter key="Publisher" value="//@publisher"/> <parameter key="Score" value="//@score"/> </list> <list key="namespaces"/> <parameter key="ignore_CDATA" value="false"/> <parameter key="assume_html" value="false"/> <list key="index_queries"/> <list key="jsonpath_queries"/> </operator> <connect from_port="segment" to_op="Extract Information" to_port="document"/> <connect from_op="Extract Information" from_port="document" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="544"> <parameter key="text_attribute" value="content"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="391"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attributes" value="content|query_key"/> <parameter key="invert_selection" value="true"/> </operator> <connect from_op="Get Page" from_port="output" to_op="HTML to XML" to_port="document"/> <connect from_op="HTML to XML" from_port="document" to_op="Replace Tokens" to_port="document"/> <connect from_op="Replace Tokens" from_port="document" to_op="Process XSLT" to_port="document"/> <connect from_op="Create Document" from_port="output" to_op="Process XSLT" to_port="xslt document"/> <connect from_op="Process XSLT" from_port="document" to_op="Cut Document" to_port="document"/> <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/> <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="store" compatibility="9.0.003" expanded="true" height="68" name="Store" width="90" x="313" y="34"> <parameter key="repository_entry" value="New Output of Web Pages/RT Reviews"/> </operator> <connect from_op="Loop" from_port="output 1" to_op="Store" to_port="input"/> <connect from_op="Store" from_port="through" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
5 -
I really wondered can't I use Cut Document (using Xpath) directly to the output of Get Page? Why do I need all the hassle? I am trying to apply a Xpath selector to a webpage and couldn't get it working until I tried your solution but I want to learn the logic. Should I convert all html files to xml first? Isn't there a way to select a part of a web site directly?0
-
Xpath does expect proper XML to work with. If your source code (your get page output) is proper XHTML, and does't use to many namespaces you can do this directly. But as in reality most websites use a very flexible way of dealing with XHTML, and have doctypes all in the wrong places it is always safer to do some cleaning in advance. By experience I know only a small amount of websites are having real valid XML code in their source data.
Now, if you are pretty familiar with XPath and XSLT I'd suggest to use the process XSLT operator instead. Just insert your XSLT (v1.0) in a document and convert your page any way you like as a pro...0