"web mining - replacing HTML tag elements with values from another attribute"

d1m0s
d1m0s New Altair Community Member
edited November 5 in Community Q&A
Hi guys, I'm facing the following problem: I have a set of web pages for multiple countries, one page - one country - one HTML table with some data that I need to extract.

I retrieve web pages for my URL list, extract  tables from those pages and after that I need to replace "<table>" tags in HTML attribute with "<table><caption>" + Country name from Country attribute + "</caption>"...and I got stuck here. I use Replace operator.

How can I replace a text fragment from one attribute with another attribute's value? It seems to be a trivial task,  I was not able to find a way to do that though.

Thanks a lot in advance for help.

image
image

Answers

  • Hello

    By a strange coincidence I had to do something similar and I even wrote some notes to help me in the future to remember...

    http://rapidminernotes.blogspot.com/2011/07/using-regular-expressions-with-replace.html

    regards

    Andrew
  • colo
    colo New Altair Community Member
    Hi,

    you can use the macro system for solving this task. Use "Loop Examples" to do the replacement line by line and put the following operators inside:

    "Extract Macro" with macro type data_value and extract from attribute country at index %{example} (this is the default counting macro for the loop).Then append your "Replace" operator inside the loop. For "replace by" string you can then use the macro extracted before by %{macro_name}.

    Just another remark: be careful with <table[^>]+> - this will only work, if the table has whitespace or some attributes following the element's name. A plain <table> will not be detected. Perhaps better use the asterisk instead.

    Regards
    Matthias
  • d1m0s
    d1m0s New Altair Community Member
    colo wrote:

    Hi,

    you can use the macro system for solving this task. Use "Loop Examples" to do the replacement line by line and put the following operators inside:

    "Extract Macro" with macro type data_value and extract from attribute country at index %{example} (this is the default counting macro for the loop).Then append your "Replace" operator inside the loop. For "replace by" string you can then use the macro extracted before by %{macro_name}.

    Just another remark: be careful with <table[^>]+> - this will only work, if the table has whitespace or some attributes following the element's name. A plain <table> will not be detected. Perhaps better use the asterisk instead.

    Regards
    Matthias
    Hi Matthias, thanks for your help. Something goes wrong. The macro takes the very first example's value from my Country attribute and applies it to all examples. I get all my tables tagged Afghanistan.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
        <process expanded="true" height="341" width="547">
          <operator activated="true" class="read_excel" compatibility="5.1.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
            <parameter key="excel_file" value="C:\Users\Dmitry\Desktop\web_extract.xls"/>
            <parameter key="imported_cell_range" value="A1:B110"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="country.true.text.attribute"/>
              <parameter key="1" value="link.true.text.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="5.1.000" expanded="true" height="60" name="Get Pages" width="90" x="200" y="32">
            <parameter key="link_attribute" value="link"/>
            <parameter key="page_attribute" value="html"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.1.006" expanded="true" height="76" name="Select Attributes" width="90" x="246" y="210">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="|country|html"/>
          </operator>
          <operator activated="true" class="loop_examples" compatibility="5.1.006" expanded="true" height="76" name="Loop Examples" width="90" x="380" y="210">
            <parameter key="parallelize_example_process" value="true"/>
            <process expanded="true" height="587" width="911">
              <operator activated="true" class="extract_macro" compatibility="5.1.006" expanded="true" height="60" name="Extract Macro" width="90" x="112" y="30">
                <parameter key="macro" value="country"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="attribute_name" value="country"/>
                <parameter key="example_index" value="%{example}"/>
              </operator>
              <operator activated="true" class="replace" compatibility="5.1.006" expanded="true" height="76" name="Replace" width="90" x="246" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="html"/>
                <parameter key="replace_what" value="&lt;table[^&gt;]+&gt;"/>
                <parameter key="replace_by" value="&lt;table&gt;&lt;caption&gt;%{country}&lt;/caption&gt;"/>
              </operator>
              <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_op="Replace" to_port="example set input"/>
              <connect from_op="Replace" from_port="example set output" to_port="example set"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
          <connect from_op="Loop Examples" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • d1m0s
    d1m0s New Altair Community Member
    awchisholm wrote:

    Hello

    By a strange coincidence I had to do something similar and I even wrote some notes to help me in the future to remember...

    http://rapidminernotes.blogspot.com/2011/07/using-regular-expressions-with-replace.html

    regards

    Andrew
    Thanks Andrew, I'll try your method now
  • d1m0s
    d1m0s New Altair Community Member
    Generate Attributes operator did the job...thanks to everyone for ideas