xPath Queries in RapidMiner
nfoMagic
New Altair Community Member
Hey all,
in the meantime i spent plenty of hours with Rapidminer trying to get a clean text out of a html document using xPath.
I built some xPath queries which worked fine with Google Docs (Spreadsheets), but it seems no matter what i do they won´t work with Rapidminer properly
The query "//div[@id='review']/div/div/div[2]/div[2]" (@website: "http://www.holidaycheck.de/hotelbewertung-ferienbauernhof+arnoldgut+familie+mayrhofer+unser+erster+super+toller+bauernhofurlaub-ch_hb-id_7215281.html") in Google Docs returns exactly the text i want to have. When i try to send the query in Rapidminer, the attribute generated by the "Extract Information" operator contains nothing.
I´ve tested different queries which all worked in Google Docs, but only some of them are working in RM.
The querie "//h:div[@id='reviewTypeLong']" works in RM and the returned text contains all the information i need. The problem here is that i haven´t found any way to remove the html tags yet. I´ve tried the "Cut Documents" and "Remove Document Parts" operators with the RegEx <[^>]*> but it doesn´t to what it should. Further i don´t know how to use the "Extract Content" operator on attributes, so i could remove the html tags after i extracted the useful parts of the website.
I´m really getting crazy with this, and before I spent several hours more, I hope that some experienced "Rapid-Miners" could help me with this.
Lots of thanks in advance for any help!!!
-----------------------------------------------------------------------------------------------------------
My Rapidminer Process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="550" width="882">
<operator activated="true" class="web:get_webpage" compatibility="5.2.003" expanded="true" height="60" name="Get Page" width="90" x="179" y="75">
<parameter key="url" value="http://www.holidaycheck.de/hotelbewertung-ferienbauernhof+arnoldgut+familie+mayrhofer+unser+erster+super+toller+bauernhofurlaub-ch_hb-id_7215281.html"/>
<parameter key="random_user_agent" value="true"/>
<parameter key="accept_cookies" value="all"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="380" y="30">
<process expanded="true" height="607" width="935">
<operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information (2)" width="90" x="112" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="possibility1" value="//h:div[@id='reviewTypeLong']"/>
<parameter key="possibility2" value="//h:div[@id='review']/div/div/div[2]/div[2]"/>
<parameter key="possibility1TEXT" value="//h:div[@id='reviewTypeLong']//text()"/>
<parameter key="possibility2TEXT" value="//h:div[@id='review']/div/div/div[2]/div[2]//text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<operator activated="false" class="text:remove_document_parts" compatibility="5.2.004" expanded="true" height="60" name="Remove Document Parts" width="90" x="514" y="120">
<parameter key="deletion_regex" value=" <[^>]*>"/>
</operator>
<operator activated="false" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="313" y="120">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="text" value=" <[^>]*>"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true">
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
in the meantime i spent plenty of hours with Rapidminer trying to get a clean text out of a html document using xPath.
I built some xPath queries which worked fine with Google Docs (Spreadsheets), but it seems no matter what i do they won´t work with Rapidminer properly
The query "//div[@id='review']/div/div/div[2]/div[2]" (@website: "http://www.holidaycheck.de/hotelbewertung-ferienbauernhof+arnoldgut+familie+mayrhofer+unser+erster+super+toller+bauernhofurlaub-ch_hb-id_7215281.html") in Google Docs returns exactly the text i want to have. When i try to send the query in Rapidminer, the attribute generated by the "Extract Information" operator contains nothing.
I´ve tested different queries which all worked in Google Docs, but only some of them are working in RM.
The querie "//h:div[@id='reviewTypeLong']" works in RM and the returned text contains all the information i need. The problem here is that i haven´t found any way to remove the html tags yet. I´ve tried the "Cut Documents" and "Remove Document Parts" operators with the RegEx <[^>]*> but it doesn´t to what it should. Further i don´t know how to use the "Extract Content" operator on attributes, so i could remove the html tags after i extracted the useful parts of the website.
I´m really getting crazy with this, and before I spent several hours more, I hope that some experienced "Rapid-Miners" could help me with this.
Lots of thanks in advance for any help!!!
-----------------------------------------------------------------------------------------------------------
My Rapidminer Process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="550" width="882">
<operator activated="true" class="web:get_webpage" compatibility="5.2.003" expanded="true" height="60" name="Get Page" width="90" x="179" y="75">
<parameter key="url" value="http://www.holidaycheck.de/hotelbewertung-ferienbauernhof+arnoldgut+familie+mayrhofer+unser+erster+super+toller+bauernhofurlaub-ch_hb-id_7215281.html"/>
<parameter key="random_user_agent" value="true"/>
<parameter key="accept_cookies" value="all"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="380" y="30">
<process expanded="true" height="607" width="935">
<operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information (2)" width="90" x="112" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="possibility1" value="//h:div[@id='reviewTypeLong']"/>
<parameter key="possibility2" value="//h:div[@id='review']/div/div/div[2]/div[2]"/>
<parameter key="possibility1TEXT" value="//h:div[@id='reviewTypeLong']//text()"/>
<parameter key="possibility2TEXT" value="//h:div[@id='review']/div/div/div[2]/div[2]//text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<operator activated="false" class="text:remove_document_parts" compatibility="5.2.004" expanded="true" height="60" name="Remove Document Parts" width="90" x="514" y="120">
<parameter key="deletion_regex" value=" <[^>]*>"/>
</operator>
<operator activated="false" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="313" y="120">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="text" value=" <[^>]*>"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true">
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0