XPATH returns no results

mimesis
mimesis New Altair Community Member
edited November 5 in Community Q&A
Hi,
I am trying to grab the text of abstracts from a journal using XPATH in cut documents. I downloaded a test set and saved as html and am using Process Document from Files with a Cut document operator nested inside. The site I am testing is here:

http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9221.2010.00797.x/abstract

Using Firebug in FireFox, I inspected the element and determined that the XPATH is both:

/html/body/div[3]/div/div[5]/div[4]/div[3]/div/div[2]/p
and,
//div[@class='para']

I simplified the first one to: /div/div/div/div/div/div/div[2]/p. I tested both XPATH queries online using Google Docs and the extraction worked fine. However, I have not been able to successfuly replicate the result in RapidMiner. Am I missing something in the namespace? I have tried various versions of the XPATH syntax and the namespace settings. Note that I have run an  extract content sequence etc. in parellel with a port multiplier and have not had problems getting the text tokenized, turned into word vectors etc. Here is the XML for just a simple Cut Document inside Process Doc from Files chain. 

My XML:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <process expanded="true" height="655" width="918">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="149" y="116">
        <list key="text_directories">
          <parameter key="all-pp" value="/Users/williamfchiu/Desktop/politicalpsych_test"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="create_word_vector" value="false"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <process expanded="true" height="637" width="867">
          <operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="112" y="210">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="fulltext" value="//div/div/div/div/div/div/div[2]/p"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="655" width="919">
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

William
Tagged:

Answers

  • Hello

    Try this //h:div[2]/h:p

    the h: is some sort of html namespace

    regards

    Andrew
  • Miguel_B_scher
    Miguel_B_scher New Altair Community Member
    Hello mimesis.

    awchisholm ist right you have to use the //h: .

    Try this Xpath Command:

    //h:div[@class="para"]/text()

    This should give you the text in the the abstract box of your site.
    Remember that you cant just use the firebug / xpath generator Xpath Commands of firefox. You need to add some stuff like the namespace etc.
    No you should be able to get some results using our examples.

    Greetings
    Miguel
  • mimesis
    mimesis New Altair Community Member
    Greetings Andrew and Miguel,

    I tried Andrew's suggestion and my first few tries didn't seem to work. I will try the second version and see what happens.

    Also, do you (or does anyone) happen to know why I can't connect the output of a Cut Document to Extract Content or Tokenize operators. The output seems to be "doc" but the latter two report an error message saying that IOObjectCollection was delivered rather than Document.

    William
  • colo
    colo New Altair Community Member
    Hi William,

    if you left the process as you posted it above, then there will be no results even if you use correct XPath queries. You forget connecting the inner ports of the "Cut Document" operator. If you just connect them, every single document part selected by the XPath query will be delivered to the results collection. Since every part will be treated as a single document the overall result of "Cut Document" generates a collection of documents. Operators like "Extract Content" or "Tokenize" only work on a single document. To make use of them, you can either place them inside the "Cut Document" operator to process every single document part or loop over the elements of the collection afterwards.

    Regards
    Matthias
  • mimesis
    mimesis New Altair Community Member
    Hi Matthias,

    I actually discovered what you said on my own and have been successful in extracting content and performing text processing. In fact, I have run the algorithm (modified from what I posted at the beginning) successfully last night on a dataset of 500 records. I think I am ready to scale it some more. Thanks to all for your help.

    Best regards,

    William