🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

xPath Queries in RapidMiner

User: "nfoMagic"
New Altair Community Member
Updated by Jocelyn
Hey all,

in the meantime i spent plenty of hours with Rapidminer trying to get a clean text out of a html document using xPath.
I built some xPath queries which worked fine with Google Docs (Spreadsheets), but it seems no matter what i do they won´t work with Rapidminer properly :(

The query "//div[@id='review']/div/div/div[2]/div[2]" (@website: "http://www.holidaycheck.de/hotelbewertung-ferienbauernhof+arnoldgut+familie+mayrhofer+unser+erster+super+toller+bauernhofurlaub-ch_hb-id_7215281.html") in Google Docs returns exactly the text i want to have. When i try to send the query in Rapidminer, the attribute generated by the "Extract Information" operator contains nothing.

I´ve tested different queries which all worked in Google Docs, but only some of them are working in RM.
The querie "//h:div[@id='reviewTypeLong']" works in RM and the returned text contains all the information i need. The problem here is  that i haven´t found any way to remove the html tags yet. I´ve tried the "Cut Documents" and "Remove Document Parts" operators with the RegEx  <[^>]*>  but it doesn´t to what it should. Further i don´t know how to use the  "Extract Content" operator on attributes, so i could remove the html tags after i extracted the useful parts of the website.

I´m really getting crazy with this, and before I spent several hours more, I hope that some experienced "Rapid-Miners" could help me with this.

Lots of thanks in advance for any help!!!


-----------------------------------------------------------------------------------------------------------
My Rapidminer Process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="550" width="882">
      <operator activated="true" class="web:get_webpage" compatibility="5.2.003" expanded="true" height="60" name="Get Page" width="90" x="179" y="75">
        <parameter key="url" value="http://www.holidaycheck.de/hotelbewertung-ferienbauernhof+arnoldgut+familie+mayrhofer+unser+erster+super+toller+bauernhofurlaub-ch_hb-id_7215281.html"/>
        <parameter key="random_user_agent" value="true"/>
        <parameter key="accept_cookies" value="all"/>
        <list key="query_parameters"/>
        <list key="request_properties"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="380" y="30">
        <process expanded="true" height="607" width="935">
          <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information (2)" width="90" x="112" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="possibility1" value="//h:div[@id='reviewTypeLong']"/>
              <parameter key="possibility2" value="//h:div[@id='review']/div/div/div[2]/div[2]"/>
              <parameter key="possibility1TEXT" value="//h:div[@id='reviewTypeLong']//text()"/>
              <parameter key="possibility2TEXT" value="//h:div[@id='review']/div/div/div[2]/div[2]//text()"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <operator activated="false" class="text:remove_document_parts" compatibility="5.2.004" expanded="true" height="60" name="Remove Document Parts" width="90" x="514" y="120">
            <parameter key="deletion_regex" value=" &lt;[^&gt;]*&gt;"/>
          </operator>
          <operator activated="false" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="313" y="120">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="text" value=" &lt;[^&gt;]*&gt;"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true">
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
          <connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Find more posts tagged with

Comments

No comments on this post.