Problem extracting data

Hello I am new to rapidminer. I started out with a simple craiglist scrape. However, I do not get any data back. Can some one please advise?



no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="-20" width="-50">
      <operator activated="true" class="web:process_web" compatibility="5.2.003" expanded="true" height="60" name="Process Documents from Web" width="90" x="36" y="46">
        <parameter key="url" value="http://tampa.craigslist.org/cto"/>
        <list key="crawling_rules"/>
        <parameter key="add_pages_as_attribute" value="true"/>
        <parameter key="domain" value="subtree"/>
        <parameter key="max_page_size" value="10000"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
        <process expanded="true" height="171" width="738">
          <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="114" y="24">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Binominal"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="link" value="//*[@id=&amp;quot;toc_rows&quot;]/p"/>
              <parameter key="price" value="//*[@id=&amp;quot;toc_rows&quot;]/p[2]/span"/>
              <parameter key="location" value="//*[@id=&amp;quot;toc_rows&quot;]/p[2]/span[6]/font"/>
              <parameter key="title" value="/html/body/article/section/h2"/>
              <parameter key="ad body" value="//*[@id=&amp;quot;userbody&quot;]"/>
              <parameter key="postingid" value="/html/body/article/section/p"/>
              <parameter key="email" value="/html/body/article/section/section[1]/small/a"/>
            </list>
            <list key="namespaces">
              <parameter key="postingtitle" value="*[local-name(.) = 'postingtitle']"/>
              <parameter key="body" value="*[local-name(.) = 'userbody']"/>
              <parameter key="email" value="*[local-name(.) = 'small']"/>
            </list>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Find more posts tagged with

AI Studio

Accepted answers

All comments

Hi,

in Extract Information you have ticked "assume html". This is usually a good idea. However, that also means that all html tags are in the h: namespace. If you adjust your XPaths to match for e.g. "h:p" and "h:span" etc. instead of using only "p" and "span" you will get results.

Next time, please be more careful when posting the process xml and follow the instructions in the post linked from my signature - I had a hard time copying it into my RapidMiner instance.

Best regards,
Marius

Quick Links