"web mining - replacing HTML tag elements with values from another attribute"
d1m0s
New Altair Community Member
Hi guys, I'm facing the following problem: I have a set of web pages for multiple countries, one page - one country - one HTML table with some data that I need to extract.
I retrieve web pages for my URL list, extract tables from those pages and after that I need to replace "<table>" tags in HTML attribute with "<table><caption>" + Country name from Country attribute + "</caption>"...and I got stuck here. I use Replace operator.
How can I replace a text fragment from one attribute with another attribute's value? It seems to be a trivial task, I was not able to find a way to do that though.
Thanks a lot in advance for help.
I retrieve web pages for my URL list, extract tables from those pages and after that I need to replace "<table>" tags in HTML attribute with "<table><caption>" + Country name from Country attribute + "</caption>"...and I got stuck here. I use Replace operator.
How can I replace a text fragment from one attribute with another attribute's value? It seems to be a trivial task, I was not able to find a way to do that though.
Thanks a lot in advance for help.
Tagged:
0
Answers
-
Hello
By a strange coincidence I had to do something similar and I even wrote some notes to help me in the future to remember...
http://rapidminernotes.blogspot.com/2011/07/using-regular-expressions-with-replace.html
regards
Andrew0 -
Hi,
you can use the macro system for solving this task. Use "Loop Examples" to do the replacement line by line and put the following operators inside:
"Extract Macro" with macro type data_value and extract from attribute country at index %{example} (this is the default counting macro for the loop).Then append your "Replace" operator inside the loop. For "replace by" string you can then use the macro extracted before by %{macro_name}.
Just another remark: be careful with <table[^>]+> - this will only work, if the table has whitespace or some attributes following the element's name. A plain <table> will not be detected. Perhaps better use the asterisk instead.
Regards
Matthias0 -
Hi Matthias, thanks for your help. Something goes wrong. The macro takes the very first example's value from my Country attribute and applies it to all examples. I get all my tables tagged Afghanistan.colo wrote:
Hi,
you can use the macro system for solving this task. Use "Loop Examples" to do the replacement line by line and put the following operators inside:
"Extract Macro" with macro type data_value and extract from attribute country at index %{example} (this is the default counting macro for the loop).Then append your "Replace" operator inside the loop. For "replace by" string you can then use the macro extracted before by %{macro_name}.
Just another remark: be careful with <table[^>]+> - this will only work, if the table has whitespace or some attributes following the element's name. A plain <table> will not be detected. Perhaps better use the asterisk instead.
Regards
Matthias
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="341" width="547">
<operator activated="true" class="read_excel" compatibility="5.1.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="C:\Users\Dmitry\Desktop\web_extract.xls"/>
<parameter key="imported_cell_range" value="A1:B110"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="country.true.text.attribute"/>
<parameter key="1" value="link.true.text.attribute"/>
</list>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.1.000" expanded="true" height="60" name="Get Pages" width="90" x="200" y="32">
<parameter key="link_attribute" value="link"/>
<parameter key="page_attribute" value="html"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.1.006" expanded="true" height="76" name="Select Attributes" width="90" x="246" y="210">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|country|html"/>
</operator>
<operator activated="true" class="loop_examples" compatibility="5.1.006" expanded="true" height="76" name="Loop Examples" width="90" x="380" y="210">
<parameter key="parallelize_example_process" value="true"/>
<process expanded="true" height="587" width="911">
<operator activated="true" class="extract_macro" compatibility="5.1.006" expanded="true" height="60" name="Extract Macro" width="90" x="112" y="30">
<parameter key="macro" value="country"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="country"/>
<parameter key="example_index" value="%{example}"/>
</operator>
<operator activated="true" class="replace" compatibility="5.1.006" expanded="true" height="76" name="Replace" width="90" x="246" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="html"/>
<parameter key="replace_what" value="<table[^>]+>"/>
<parameter key="replace_by" value="<table><caption>%{country}</caption>"/>
</operator>
<connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_port="example set"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
</process>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
<connect from_op="Loop Examples" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0 -
Thanks Andrew, I'll try your method nowawchisholm wrote:
Hello
By a strange coincidence I had to do something similar and I even wrote some notes to help me in the future to remember...
http://rapidminernotes.blogspot.com/2011/07/using-regular-expressions-with-replace.html
regards
Andrew0 -
Generate Attributes operator did the job...thanks to everyone for ideas0