How to use a Macro on Extract Information
Marco_Barradas
Altair Employee
Hi! I'm looking for some help with the Extarct Information operator combined with macros. I have built a crawling WebService with RapidMiner Server which extract prices of products from different pages.
The Layout is Simple.
The only thing that changes is the RegEx used to extract the information from each page.
I tried to create an exampleset with the domain and rules for each field in order to keep is simple to add new domains that could be crawled but when I try to use a macro under the query expression nothing happens.
Does anybody have tried to use this approach? how could I use the Set Parameters from ExampleSet with the Extract Information operator.
The Layout is Simple.
The only thing that changes is the RegEx used to extract the information from each page.
I tried to create an exampleset with the domain and rules for each field in order to keep is simple to add new domains that could be crawled but when I try to use a macro under the query expression nothing happens.
Does anybody have tried to use this approach? how could I use the Set Parameters from ExampleSet with the Extract Information operator.
<?xml version="1.0" encoding="UTF-8"?><process version="9.4.001"> <context> <input/> <output/> <macros> <macro> <key>url</key> <value>https://www.elpalaciodehierro.com/charm-chile-39861202.html</value> </macro> </macros> </context> <operator activated="true" class="process" compatibility="9.4.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34"> <parameter key="url" value="%{url}"/> <parameter key="random_user_agent" value="true"/> <parameter key="connection_timeout" value="10000"/> <parameter key="read_timeout" value="10000"/> <parameter key="follow_redirects" value="true"/> <parameter key="accept_cookies" value="all"/> <parameter key="cookie_scope" value="thread"/> <parameter key="request_method" value="GET"/> <list key="query_parameters"/> <list key="request_properties"/> <parameter key="override_encoding" value="false"/> <parameter key="encoding" value="SYSTEM"/> </operator> <operator activated="true" class="text:process_documents" compatibility="8.2.000" expanded="true" height="103" name="Process Documents" width="90" x="246" y="34"> <parameter key="create_word_vector" value="false"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="false"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <process expanded="true"> <operator activated="true" class="text:extract_information" compatibility="8.2.000" expanded="true" height="68" name="elpalacio (2)" width="90" x="313" y="34"> <parameter key="query_type" value="Regular Expression"/> <list key="string_machting_queries"/> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"> <parameter key="nombre" value="<div class="product-name ">\s{1,} <span class="h1" >(.*)</span>\s{1,}</div>"/> <parameter key="precio_n" value="<span class="price">[$]\S([0-9,.]{1,})</span>"/> <parameter key="precio_d" value=" <span class="ls-price-now-price price".*">\s{1,}[$]\S([0-9,.]{1,})\s{1,} </span>"/> <parameter key="antes" value="<span class="ls-price-bef-price price"\sid="old-price-[0-9]{1,}">\s{1,}[$]\S([0-9,.]{1,})\s{1,}</span>\s{1,}</p>"/> </list> <list key="regular_region_queries"/> <list key="xpath_queries"/> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="true"/> <list key="index_queries"/> <list key="jsonpath_queries"/> </operator> <connect from_port="document" to_op="elpalacio (2)" to_port="document"/> <connect from_op="elpalacio (2)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="generate_attributes" compatibility="9.4.001" expanded="true" height="82" name="Generate Attributes" width="90" x="380" y="34"> <list key="function_descriptions"> <parameter key="precio_n" value="if(missing(precio_n),precio_d,precio_n)"/> <parameter key="precio_d" value="if(missing(precio_d),precio_n,precio_d)"/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="parse_numbers" compatibility="9.4.001" expanded="true" height="82" name="Parse Numbers" width="90" x="514" y="34"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="precio_n|precio_d"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="true"/> <parameter key="decimal_character" value="."/> <parameter key="grouped_digits" value="true"/> <parameter key="grouping_character" value=","/> <parameter key="infinity_representation" value=""/> <parameter key="unparsable_value_handling" value="fail"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.4.001" expanded="true" height="82" name="Select Attributes" width="90" x="648" y="34"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="precio_d|precio_n|nombre"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="true"/> </operator> <connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/> <connect from_op="Process Documents" from_port="example set" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Generate Attributes" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/> <connect from_op="Parse Numbers" from_port="example set output" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
0
Answers
-
No real answer to your specific question but our approach to crawl hundreds of sites without too much process changes is to use an XSLT stylesheet before the extract operator, so we do not need to modify anything in this part. It allows us to us a 'template process' for any site we crawl.
All of the logic is in two places, being the stylesheet to get the content from a page, and in the end some transformation to get all extracted data in the same format (like price or review data)
I've attached an example using your page, but it can be easily modified to whatever other page, but you need to know some (basic) XPath.<?xml version="1.0" encoding="UTF-8"?><process version="9.4.001"> <context> <input/> <output/> <macros> <macro> <key>url</key> <value>https://www.elpalaciodehierro.com/charm-chile-39861202.html</value> </macro> </macros> </context> <operator activated="true" class="process" compatibility="9.4.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34"> <parameter key="url" value="%{url}"/> <parameter key="random_user_agent" value="true"/> <parameter key="connection_timeout" value="10000"/> <parameter key="read_timeout" value="10000"/> <parameter key="follow_redirects" value="true"/> <parameter key="accept_cookies" value="all"/> <parameter key="cookie_scope" value="thread"/> <parameter key="request_method" value="GET"/> <list key="query_parameters"/> <list key="request_properties"/> <parameter key="override_encoding" value="false"/> <parameter key="encoding" value="SYSTEM"/> </operator> <operator activated="true" class="subprocess" compatibility="9.4.001" expanded="true" height="82" name="preclean" width="90" x="246" y="34"> <process expanded="true"> <operator activated="true" class="text:replace_tokens" compatibility="8.2.000" expanded="true" height="68" name="Replace Tokens (2)" width="90" x="45" y="34"> <list key="replace_dictionary"> <parameter key="(?is)^.*?<html[^>]+>" value="<html>"/> </list> <description align="center" color="transparent" colored="false" width="126">clean header for badly constructed sites</description> </operator> <operator activated="true" class="text:html_to_xml" compatibility="8.2.000" expanded="true" height="68" name="Html To Xml (2)" width="90" x="179" y="34"/> <operator activated="true" class="text:replace_tokens" compatibility="8.2.000" expanded="true" height="68" name="Replace Tokens (3)" width="90" x="313" y="34"> <list key="replace_dictionary"> <parameter key="(?is)^.*?<html[^>]+>" value="<html>"/> </list> <description align="center" color="transparent" colored="false" width="126">get rid of xml namespaces</description> </operator> <connect from_port="in 1" to_op="Replace Tokens (2)" to_port="document"/> <connect from_op="Replace Tokens (2)" from_port="document" to_op="Html To Xml (2)" to_port="document"/> <connect from_op="Html To Xml (2)" from_port="document" to_op="Replace Tokens (3)" to_port="document"/> <connect from_op="Replace Tokens (3)" from_port="document" to_port="out 1"/> <portSpacing port="source_in 1" spacing="0"/> <portSpacing port="source_in 2" spacing="0"/> <portSpacing port="sink_out 1" spacing="0"/> <portSpacing port="sink_out 2" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">HTML to XHTML so we can use XPath</description> </operator> <operator activated="true" class="text:create_document" compatibility="8.2.000" expanded="true" height="68" name="xslt" width="90" x="380" y="136"> <parameter key="text" value="<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 	<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> 	<xsl:template match="/"> 		<root> <xsl:for-each select="//div[@class='product-shop']"> <row product_name="{.//span[@class='h1']}" product_price="{.//span[@class='price']}" /> </xsl:for-each> 		</root> 	</xsl:template> </xsl:stylesheet> "/> <parameter key="add label" value="false"/> <parameter key="label_type" value="nominal"/> <description align="center" color="transparent" colored="false" width="126">Stylesheet for this specific page, but it is a template for any other page</description> </operator> <operator activated="true" class="subprocess" compatibility="9.4.001" expanded="true" height="103" name="XML 2 Data" width="90" x="514" y="34"> <process expanded="true"> <operator activated="true" class="text:process_xslt" compatibility="8.2.000" expanded="true" height="82" name="Process Xslt (10)" width="90" x="45" y="34"/> <operator activated="true" class="text:cut_document" compatibility="8.2.000" expanded="true" height="68" name="Cut Document (11)" width="90" x="179" y="34"> <parameter key="query_type" value="Regular Region"/> <list key="string_machting_queries"/> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"> <parameter key="row" value="<row./>"/> </list> <list key="xpath_queries"> <parameter key="model" value="//model"/> </list> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="true"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <operator activated="true" class="text:extract_information" compatibility="8.2.000" expanded="true" height="68" name="Extract Information (11)" width="90" x="112" y="34"> <parameter key="query_type" value="XPath"/> <list key="string_machting_queries"/> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"/> <list key="xpath_queries"> <parameter key="Product.Name" value="//@product_name"/> <parameter key="Product.Code" value="//@product_code"/> <parameter key="Product.Image" value="//@product_image"/> <parameter key="Product.ReviewQty" value="//@product_review_qty"/> <parameter key="Product.ReviewAvg" value="//@product_review_avg"/> <parameter key="Product.StockStatus" value="//@product_stockstatus"/> <parameter key="Product.Price" value="//@product_price"/> </list> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="false"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <description align="center" color="transparent" colored="false" width="126">extract all matching attributes. Not all sites have all the same content but having a set of uniform labels allows for reusing</description> </operator> <connect from_port="segment" to_op="Extract Information (11)" to_port="document"/> <connect from_op="Extract Information (11)" from_port="document" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:documents_to_data" compatibility="8.2.000" expanded="true" height="82" name="Documents to Data (15)" width="90" x="313" y="34"> <parameter key="text_attribute" value="src"/> <parameter key="add_meta_information" value="true"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <parameter key="use_processed_text" value="false"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.4.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="447" y="34"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="query_key|src"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="trim" compatibility="9.4.001" expanded="true" height="82" name="Trim (4)" width="90" x="581" y="34"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="operator_toolbox:filter_missing_attributes" compatibility="2.2.000" expanded="true" height="82" name="Filter Attributes with Missing Values" width="90" x="715" y="34"> <parameter key="filter_method" value="one or more non-missing"/> <parameter key="maximum_number_of_missings" value="100"/> <parameter key="maximum_relative_number_of_missings" value="0.1"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <description align="center" color="transparent" colored="false" width="126">simply remove all empty (non existing) attributes</description> </operator> <connect from_port="in 1" to_op="Process Xslt (10)" to_port="document"/> <connect from_port="in 2" to_op="Process Xslt (10)" to_port="xslt document"/> <connect from_op="Process Xslt (10)" from_port="document" to_op="Cut Document (11)" to_port="document"/> <connect from_op="Cut Document (11)" from_port="documents" to_op="Documents to Data (15)" to_port="documents 1"/> <connect from_op="Documents to Data (15)" from_port="example set" to_op="Select Attributes (2)" to_port="example set input"/> <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Trim (4)" to_port="example set input"/> <connect from_op="Trim (4)" from_port="example set output" to_op="Filter Attributes with Missing Values" to_port="example set"/> <connect from_op="Filter Attributes with Missing Values" from_port="filtered example set" to_port="out 1"/> <portSpacing port="source_in 1" spacing="0"/> <portSpacing port="source_in 2" spacing="0"/> <portSpacing port="source_in 3" spacing="0"/> <portSpacing port="sink_out 1" spacing="0"/> <portSpacing port="sink_out 2" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">Always same process, only stylesheet differs</description> </operator> <connect from_op="Get Page" from_port="output" to_op="preclean" to_port="in 1"/> <connect from_op="preclean" from_port="out 1" to_op="XML 2 Data" to_port="in 1"/> <connect from_op="xslt" from_port="output" to_op="XML 2 Data" to_port="in 2"/> <connect from_op="XML 2 Data" from_port="out 1" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
0 -
Hi @kayman as you mention this approach is another way of solving my current task but it changing RegEx for Xpath and may or may not flexible enough in some cases that I've seen on my webcrawling so far. Sometimes the price of the product is listed under a class that seems to be dynamic like class=article-abcha67323876-asdalji a lot a attributes or text after>$ 1,000</span>
That's why I took the RegEx approach and it seemed simple to store the RegEx rules on a table and so that I only needed to create a new row under the Datatable to create a new rule to extract data from certain domain.
I'll wait to see if anybody else gives us a solution before marking your answer as a solution.
But thanks for the process and the approach it helps me see other ways of solving my task.
0 -
Yeah, tell me about it... We crawl 200 sites at the moment and they all have their own specific way of making it complex. In the end it's all about flexibility, and for us putting the logic in XPath worked out best.
We didn't find a single site yet where we couldn't get the data with XPath, where regex would have been more challenging. One way to deal with dynamic attributes is just looking at surrounding tags, typically product listings are in a list for instance, or you could use something like 'span where the text contains a dollar sign', which typically works out fine also.
Now, I actually like your idea but will probably use it on our XSLT template rather than on the extract operator.
In essence our Xpath is always the same as we are looking for like 10 different items (price, availabilty, used image etc) but whether they exist or not doesn't matter. The only difference from site to site is the path to a value, and if it's not it just returns an empty attribute.
So I am going to use your idea to dynamically inject this paths in my XSLT, I can then indeed create a reference file that contains the XPaths by site and loop this through one template instead of using a template by site.1 -
Yes you could create a DS with your XSLT configuration and query it to obtain a macro that could be used on your operator,this way you only have One process to maintain and if for some reason you need to extract another value it would be really easy to make the change for the 200 sites instead of going over 200 process to make this little change.
0