replace content with html tag with new content
Aj
New Altair Community Member
Hello,
I am trying to extract content from a URL. I am using XPATH to extract the values in to a vector using "Cut Document" element. After this, I do some post processing to associate it with other content that I extracted.
Sometimes, the website has some missing values. Due to this, I am not able correctly associate different content extracted separately using different XPATH commands.
To overcome this problem, I have tried to replace the missing value in the content with "-1.00" using "Replace Tokens". When I put a debug point after this element, the output is divided in to two - the upper part has the replacement and has -1.00 in between the corresponding tags, as desired, but lower part does not have (I am guessing it to be the original content). Also, the "Cut Document" does not show any value of -1.00 in the parsed values.
It is difficult to explain the whole problem, but the problem can be easily understood if you import the following code in to rapidminer. I am trying to extract the contents of the following URL
http://www.oddsportal.com/matches/tennis/20111211/
Please note that the equivalent of 51.00 are missing in two of the rows.
Please help as I have already spent lot of time on it. Also, it is a basic problem as many website will have some sort of missing values like this, so I need to know how to resolve it.
Also, one more question, how to extract in to a variable of rapidminer, if the result of XPATH command is a nxn matrix, where n>1, rather than a nx1 vector. "Cut Document" does not seem to work in that situation.
Thanks,
Aj
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="386" width="614">
<operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page (5)" width="90" x="45" y="30">
<parameter key="url" value="http://www.oddsportal.com/matches/tennis/20111211/"/>
<parameter key="user_agent" value="Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10"/>
<parameter key="accept_cookies" value="all"/>
<list key="query_parameters">
<parameter key="TimeZone" value="66"/>
</list>
</operator>
<operator activated="true" class="text:replace_tokens" compatibility="5.1.001" expanded="true" height="60" name="Replace Tokens" width="90" x="179" y="30">
<list key="replace_dictionary">
<parameter key="(this\)"></a>)" value="this\)">-1.00</a>"/>
</list>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document (8)" width="90" x="313" y="75">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="odds" value="//h:td[contains(@class,'nowrp center table-odds')]/h:a/text()"/>
</list>
<list key="namespaces">
<parameter key="xx" value="xml"/>
</list>
<list key="index_queries"/>
<process expanded="true">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data (3)" width="90" x="447" y="75">
<parameter key="text_attribute" value="odds"/>
<parameter key="label_attribute" value="abc"/>
<parameter key="add_meta_information" value="false"/>
</operator>
<connect from_op="Get Page (5)" from_port="output" to_op="Replace Tokens" to_port="document"/>
<connect from_op="Replace Tokens" from_port="document" to_op="Cut Document (8)" to_port="document"/>
<connect from_op="Cut Document (8)" from_port="documents" to_op="Documents to Data (3)" to_port="documents 1"/>
<connect from_op="Documents to Data (3)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I am trying to extract content from a URL. I am using XPATH to extract the values in to a vector using "Cut Document" element. After this, I do some post processing to associate it with other content that I extracted.
Sometimes, the website has some missing values. Due to this, I am not able correctly associate different content extracted separately using different XPATH commands.
To overcome this problem, I have tried to replace the missing value in the content with "-1.00" using "Replace Tokens". When I put a debug point after this element, the output is divided in to two - the upper part has the replacement and has -1.00 in between the corresponding tags, as desired, but lower part does not have (I am guessing it to be the original content). Also, the "Cut Document" does not show any value of -1.00 in the parsed values.
It is difficult to explain the whole problem, but the problem can be easily understood if you import the following code in to rapidminer. I am trying to extract the contents of the following URL
http://www.oddsportal.com/matches/tennis/20111211/
Please note that the equivalent of 51.00 are missing in two of the rows.
Please help as I have already spent lot of time on it. Also, it is a basic problem as many website will have some sort of missing values like this, so I need to know how to resolve it.
Also, one more question, how to extract in to a variable of rapidminer, if the result of XPATH command is a nxn matrix, where n>1, rather than a nx1 vector. "Cut Document" does not seem to work in that situation.
Thanks,
Aj
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="386" width="614">
<operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page (5)" width="90" x="45" y="30">
<parameter key="url" value="http://www.oddsportal.com/matches/tennis/20111211/"/>
<parameter key="user_agent" value="Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10"/>
<parameter key="accept_cookies" value="all"/>
<list key="query_parameters">
<parameter key="TimeZone" value="66"/>
</list>
</operator>
<operator activated="true" class="text:replace_tokens" compatibility="5.1.001" expanded="true" height="60" name="Replace Tokens" width="90" x="179" y="30">
<list key="replace_dictionary">
<parameter key="(this\)"></a>)" value="this\)">-1.00</a>"/>
</list>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document (8)" width="90" x="313" y="75">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="odds" value="//h:td[contains(@class,'nowrp center table-odds')]/h:a/text()"/>
</list>
<list key="namespaces">
<parameter key="xx" value="xml"/>
</list>
<list key="index_queries"/>
<process expanded="true">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data (3)" width="90" x="447" y="75">
<parameter key="text_attribute" value="odds"/>
<parameter key="label_attribute" value="abc"/>
<parameter key="add_meta_information" value="false"/>
</operator>
<connect from_op="Get Page (5)" from_port="output" to_op="Replace Tokens" to_port="document"/>
<connect from_op="Replace Tokens" from_port="document" to_op="Cut Document (8)" to_port="document"/>
<connect from_op="Cut Document (8)" from_port="documents" to_op="Documents to Data (3)" to_port="documents 1"/>
<connect from_op="Documents to Data (3)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
-
Fwiw, I'm also experiencing this issue.
The best way I can describe it is that it's as if the Replace Tokens operator does -nothing-. I've tried it in every way I could think of. The debug view shows as you describe, and in there the top part shows the tokens properly replaced.. but at the next step, it's right back to the way it was.
I'm very curious if anyone in the community has experienced this problem and maybe has a fix.
Alex0