"crawler and emailextractor"

xtraplus
xtraplus New Altair Community Member
edited November 5 in Community Q&A
Hi,

My program reads a list of urls from excel crawls these and should extract something. But whatever X-Query I try nothing gets displayed in the results. The log says that the results are saved, but there is anything.

This is my code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
   <process expanded="true" height="415" width="815">
     <operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="112" y="75">
       <parameter key="excel_file" value="C:\Dokumente und Einstellungen\Home\Eigene Dateien\Rapidminer\test.xls"/>
       <parameter key="imported_cell_range" value="A1:A3"/>
       <parameter key="first_row_as_names" value="false"/>
       <list key="annotations">
         <parameter key="0" value="Name"/>
       </list>
       <list key="data_set_meta_data_information">
         <parameter key="0" value="url.true.nominal.attribute"/>
       </list>
       <parameter key="read_not_matching_values_as_missings" value="false"/>
     </operator>
     <operator activated="true" class="loop_examples" compatibility="5.2.006" expanded="true" height="112" name="Loop Examples" width="90" x="581" y="75">
       <process expanded="true" height="460" width="709">
         <operator activated="true" class="extract_macro" compatibility="5.2.006" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="30">
           <parameter key="macro" value="weburl"/>
           <parameter key="macro_type" value="data_value"/>
           <parameter key="attribute_name" value="url"/>
           <parameter key="example_index" value="%{example}"/>
         </operator>
         <operator activated="true" class="web:process_web" compatibility="5.2.000" expanded="true" height="60" name="Process Documents from Web" width="90" x="179" y="120">
           <parameter key="url" value="%{weburl}"/>
           <list key="crawling_rules">
             <parameter key="follow_link_with_matching_url" value=".+onta.+|.+about.+|.+info.+|.+suppo.+|.+impre.+"/>
           </list>
           <parameter key="add_pages_as_attribute" value="true"/>
           <parameter key="max_depth" value="3"/>
           <parameter key="delay" value="500"/>
           <parameter key="max_threads" value="5"/>
           <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0 "/>
           <process expanded="true" height="605" width="974">
             <operator activated="true" class="text:extract_information" compatibility="5.2.002" expanded="true" height="60" name="Extract Information" width="90" x="380" y="255">
               <parameter key="query_type" value="XPath"/>
               <list key="string_machting_queries"/>
               <list key="regular_expression_queries"/>
               <list key="regular_region_queries"/>
               <list key="xpath_queries">
                 <parameter key="mail" value="//h:@href"/&gt;
               </list>
               <list key="namespaces"/>
               <list key="index_queries"/>
             </operator>
             <connect from_port="document" to_op="Extract Information" to_port="document"/>
             <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
         <connect from_op="Extract Macro" from_port="example set" to_port="example set"/>
         <connect from_op="Process Documents from Web" from_port="example set" to_port="output 2"/>
         <portSpacing port="source_example set" spacing="0"/>
         <portSpacing port="sink_example set" spacing="0"/>
         <portSpacing port="sink_output 1" spacing="72"/>
         <portSpacing port="sink_output 2" spacing="0"/>
         <portSpacing port="sink_output 3" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Read Excel" from_port="output" to_op="Loop Examples" to_port="example set"/>
     <connect from_op="Loop Examples" from_port="output 1" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
Everything work, but the extraction

Can you help me please?

Regards
Ben

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi Ben,

    is the Extract Information operator executed at all (try setting a breakpoint there)? Your Process Documents operator only has a follow rule, for the pages to be processed you probably also have to specify a store-rule. Additionally, you have connected the second output of the Loop Operator to the process output, however on the inside of the loop you connected the results to the 1st and 3rd output.

    Best,
    Marius
  • xtraplus
    xtraplus New Altair Community Member
    Hi Marius,

    thank you for your reply. I corrected the connections, add a store rule and tried the breakpoints at the extractor. It seems to be reached.

    I get results for default XPath without any query specified.

    When I try as query 
    //h:@href
    it does not work and the operator gets yellow

    whats wrong with my query?

    I realized in the results display each source ulr is its own example set.

    So I think I would have problems storing results of several urls in one file.

    Do you know what I mean?

    I think I have to somehow merge the results so that they get stored as one example set

    But how?
  • MariusHelf
    MariusHelf New Altair Community Member
    If your XPath should extract the href attribute from the a tags, it should look similar to this:
    //h:a/@href
    if I got the syntax correctly on my mind.
  • xtraplus
    xtraplus New Altair Community Member
    thanks

    what do you recommend for merging the results in 1 data view list