🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

(Solved) Removing tags from extracted data

User: "jx820"
New Altair Community Member
Updated by Jocelyn
I'm very new and starting with a scraping process. It really doesn't have a function, I'm just playing around trying to learn. My process was originally based on Neil McGuigan's tutorials on Vancouver Data Blog, but as I try new things it's grown a bit.

Currently I'm crawling with the Process Documents from Web operator and using Extract Information as a sub process. I'm querying 9 attributes with xpath querys. Last I use Write Excel to output the data into a spreadsheet. All of that works fine.

The problem is the information extracted contains HTML tags, specifically H1 and TD tags and I can't find a means of removing them. I've tried an Extract Content operator, Remove Document Parts, and Replace. So far nothing has worked.

This is what a typical result looks like:
<td xmlns="http://www.w3.org/1999/xhtml" colspan="1" rowspan="1">33</td>
But all I need is the 33.



Here's the XML behind my process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <parameter key="logfile" value="C:\Users\Public\Documents\Rapidminer Repository\logfile"/>
    <parameter key="resultfile" value="C:\Users\Public\Documents\Rapidminer Repository\resultfile"/>
    <process expanded="true" height="620" width="435">
      <operator activated="true" class="web:process_web" compatibility="5.2.001" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
        <parameter key="url" value="http://www.mixedmartialarts.com/f/1BC00DA3949506AC/BJ-Penn/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://www\.mixedmartialarts\.com/f/.*"/>
        </list>
        <parameter key="max_pages" value="6"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="domain" value="server"/>
        <parameter key="delay" value="5000"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/14.0.1"/>
        <parameter key="parallelize_process_webpage" value="true"/>
        <process expanded="true" height="620" width="433">
          <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Fighter" value="//h:div[@class='Resume']/h:h1"/>
              <parameter key="Pro Record" value="//h:*[contains(.,'Pro Record:')]/../h:td[last()]"/>
              <parameter key="Team" value="//h:*[contains(.,'Team:')]/../h:td[last()]"/>
              <parameter key="Age" value="//h:*[contains(.,'Age:')]/../h:td[last()]"/>
              <parameter key="Sex" value="//h:*[contains(.,'Sex:')]/../h:td[last()]"/>
              <parameter key="Height" value="//h:*[contains(.,'Height:')]/../h:td[last()]"/>
              <parameter key="Weight" value="//h:*[contains(.,'Weight:')]/../h:td[last()]"/>
              <parameter key="Out of" value="//h:*[contains(.,'Out of:')]/../h:td[last()]"/>
              <parameter key="From" value="//h:*[contains(.,'Born:')]/../h:td[last()]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="313" y="30">
        <parameter key="excel_file" value="C:\Users\Public\Documents\Rapidminer Repository\Results\Results.xls"/>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Excel" to_port="input"/>
      <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="18"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
I checked the FAQ, the tutorials, and searched the forums, but I haven't found anything. Any suggestions?

Find more posts tagged with

Sort by:
1 - 2 of 21
    User: "Nils_Woehler"
    New Altair Community Member
    Hi,

    you can use the XPath text() function:


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <parameter key="logfile" value="C:\Users\Public\Documents\Rapidminer Repository\logfile"/>
        <parameter key="resultfile" value="C:\Users\Public\Documents\Rapidminer Repository\resultfile"/>
        <process expanded="true" height="620" width="435">
          <operator activated="true" class="web:process_web" compatibility="5.2.001" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
            <parameter key="url" value="http://www.mixedmartialarts.com/f/1BC00DA3949506AC/BJ-Penn/"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value="http://www\.mixedmartialarts\.com/f/.*"/>
            </list>
            <parameter key="max_pages" value="6"/>
            <parameter key="max_depth" value="4"/>
            <parameter key="domain" value="server"/>
            <parameter key="delay" value="5000"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/14.0.1"/>
            <process expanded="true" height="620" width="433">
              <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="Fighter" value="//h:div[@class='Resume']/h:h1/text()"/>
                  <parameter key="Pro Record" value="//h:*[contains(.,'Pro Record:')]/../h:td[last()]/text()"/>
                  <parameter key="Team" value="//h:*[contains(.,'Team:')]/../h:td[last()]/text()"/>
                  <parameter key="Age" value="//h:*[contains(.,'Age:')]/../h:td[last()]/text()"/>
                  <parameter key="Sex" value="//h:*[contains(.,'Sex:')]/../h:td[last()]/text()"/>
                  <parameter key="Height" value="//h:*[contains(.,'Height:')]/../h:td[last()]/text()"/>
                  <parameter key="Weight" value="//h:*[contains(.,'Weight:')]/../h:td[last()]/text()"/>
                  <parameter key="Out of" value="//h:*[contains(.,'Out of:')]/../h:td[last()]/text()"/>
                  <parameter key="From" value="//h:*[contains(.,'Born:')]/../h:td[last()]/text()"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="313" y="30">
            <parameter key="excel_file" value="C:\Users\nwoehler\Desktop\Results.xls"/>
          </operator>
          <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Excel" to_port="input"/>
          <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="18"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Best,
    Nils
    User: "jx820"
    New Altair Community Member
    OP
    That worked perfectly, and it was much easier than expected. Thank you.