XPATH commands working in Google docs, but not in Rapidminer

Aj
Aj New Altair Community Member
edited November 5 in Community Q&A
Hello

I am trying to extract text in Rapidminer using XPATH.

I am not able to extract
the following text "(6:4,6:7,3:6) from the following HTML code.
<div id="event-status-name">
<p id="event-status" class="result">
  <span class="bold" id="event-status-    finished">Final result 1:2</span>    (6:4,6<sup>3</sup>:7,3:6)</p></div>

I am able to extract the text "Final result 1:2" as it associated alone with class="bold" and id="event-status-finished". But, id="event-status-name" of div tag is associated with both id="event-status-name" and span tag.

I tried many versions to make it work. Some of them are

//div[@id='event-status-name'] and the corresponding rapidminer command
//h:div[@id='event-status-name']/text()

It works fine with XPATH in Google docs, but not in rapidminer. Also, when I remove "text()" in rapidminer, the code does not have the text that I am looking for, i.e the text area is blank. Any ideas as to why it is like that?

In the above code, the string "Final Result ..." along with score is displayed. In the following code,

//p[@id='event-status'][@class='result']/text()
only score related text is displayed correctly in XPATH. But, the corresponding rapidminer command
//h:p[@id='event-status'][@class='result']/text()
displays blank space

I even tried normalizing command to remove white space
//fn:normalize-space[//h:p[@id='event-status']/text()]

I tried different other versions like

//span[@class='bold'][@id='event-status-finished']/../text()
->
//h:span[@class='bold'][@id='event-status-finished']/../text()

All of them work fine in Google docs, but not in rapidminer, i.e they just display blank space in the corresponding column.

Could someone please help me in this regard.

Thanks
Aj
Tagged:

Answers

  • Andrew2
    Andrew2 New Altair Community Member
    Hello

    There's a check box called "assume html" that needs to be checked. It's on by default I think so that should be OK but you never know...

    Otherwise it's best to post the XML of your process.

    regards

    Andrew
  • Aj
    Aj New Altair Community Member
    Hi Andrew,

    Thanks a lot for replying.

    I had checked "assume html' box. Just to be sure, I verified it again and ran the simulation again. But, I am getting the same result as described in my previous post. Also, in my previous post, I forgot to mention that other XPATH commands are working fine both in Google docs and in Rapidminer. But, the attribute "fullScore" is not working - I am guessing that it is because of another tag "span" also within the div tag, apart from the text I want to extract.

    Below, I am copying the  XML that I used. I am still in the process of getting the most important things sorted out. Therefore, I did not tidy it up. In order to make this XML code to work, please follow these steps. First, please enable the element "Crawl Web" and connect it to result. Next, in this element, change the directory where you want the files to be downloaded. Then, disable the element "Process Documents from Files". After this, run the process and the files will be downloaded.

    In the downloaded directory, remove "0.html" (you can keep it too, you will get "?" corresponding to null. I just don't want you think that it is the problem as that file does not contain the data I want and I am going to filter it later). After this, disable "Crawl Web" element and enable "Process Documents from Files" element and connect it to result. Change the directory to where you have downloaded the files. Run the program. You will see three correct result and one wrong result (blank), corresponding to the attribute "fullScore".

    Thanks a lot for your help,
    Ajay


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
        <process expanded="true" height="251" width="346">
          <operator activated="false" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="30">
            <parameter key="url" value="http://www.oddsportal.com/search/results/sharapova/"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value=".+sharapova-.+"/>
            </list>
            <parameter key="output_dir" value="/home/Ajay/learnRapidMiner/downloadWebData"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="4"/>
            <parameter key="max_depth" value="1"/>
            <parameter key="delay" value="500"/>
            <parameter key="max_page_size" value="250"/>
            <parameter key="user_agent" value="Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10"/>
          </operator>
          <operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="165">
            <list key="text_directories">
              <parameter key="learn" value="/home/Ajay/learnRapidMiner/downloadWebData"/>
            </list>
            <parameter key="file_pattern" value="*.html"/>
            <parameter key="extract_text_only" value="false"/>
            <parameter key="create_word_vector" value="false"/>
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="1"/>
            <parameter key="prune_above_absolute" value="100"/>
            <process expanded="true" height="433" width="490">
              <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="playerNames" value="//h:h1/text()"/>
                  <parameter key="date" value="//h:p[contains(@class,'date')]/text()"/>
                  <parameter key="finalScore" value="//h:span[contains(@id,'event-status-finished')]/text()"/>
                  <parameter key="fullScore" value="//h:span[@class='bold'][@id='event-status-finished']/../text()"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • Aj
    Aj New Altair Community Member
    Hi Andrew,

    I have one more question related to the above described problem.

    In regards to the text "Final result 1:2 (6:4,63:7,3:6)" in the file 1.html, how to get rid of the superscript tag and its related text using "not" or some other command in XPATH, when it is embedded as part of the remaining text.

    Regards,
    Ajay
  • Andrew2
    Andrew2 New Altair Community Member
    Hello

    Quite hard questions and I can't guarantee to be able to answer them but I will have a look later :)

    regards

    Andrew
  • Andrew2
    Andrew2 New Altair Community Member
    Hello Ajay

    I'm not quite sure if I am getting the same results as you are. However, if you uncheck the "use file extension as file type" on the "process documents" operator and explicitly select "txt" in the "content type" drop down, you might find that the behaviour changes.

    My hypothesis is that the html content is being processed so the XPath is working on something slightly different thereby leading to unexpected results.

    regards

    Andrew
  • Aj
    Aj New Altair Community Member
    Hi Andrew,

    That works !! Thank you so much as I would have never guessed it and had already spent lot of time on it.

    Best Regards,
    Ajay
  • Andrew2
    Andrew2 New Altair Community Member
    good  :)

    I think I will submit a feature request to get the raw XML that is being input to the extract operator to be dumped out for inspection.

    Andrew
  • Miguel_B_scher
    Miguel_B_scher New Altair Community Member
    I think I am little bit late. But ve you tried to use other Xpath commands?

    Like:
    //h:p[@class='result']/text()

    or
    //h:p[@id='event-status']/text()

    Cause if you use the xpath command trying to get the div you also have to get the child notes of it to get your "real" text.
    Somethin like that:

    //h:div[@id='event-status-name']/h:p[@id='event-status']/text()
    or
    //h:div[@id='event-status-name']/h:p[@class='result']/text()


    I did not test this xpath commands cause I dont have your project ;)
    In my last RM project I worked a lot with Xpath and didn't have such a problem.
    Perhaps you could send your xml file of your project which use the XML Path command to get a better look.