Never results in "Process Documents from Web"

tu_162092
tu_162092 New Altair Community Member
edited November 5 in Community Q&A
Hello,

I have a problem with the operator "Process Documents from web". No matter how I set the operator, URLs will never be found, although a few months ago the process still worked and the URL structure didn't change.

I tried it with different domains, unfortunately the Rapidminer never finds URLs.

What could be the reason? It would be great if someone could help me :)!

Greetings
Tim

Best Answer

  • kayman
    kayman New Altair Community Member
    Answer ✓
    You could use the loop operator for this specific case.
    Your site has 45 listings pages (670 links, 15 by page), so set up a loop for 45 iterations, and do something per loop

    Attached example would load the next page per iteration (url/s[page_number]), get all the links (<a spans) and leave the ones needed.
    This way you can build up a list of all the links, and that list can then be used to start crawling the other sites.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="246" y="34">
            <parameter key="number_of_iterations" value="45"/>
            <parameter key="iteration_macro" value="page"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="313" y="34">
                <parameter key="url" value="https://www.gelbeseiten.de/reisebueros/berlin/s%{page}"/>
                <list key="query_parameters"/>
                <list key="request_properties"/>
              </operator>
              <operator activated="true" breakpoints="after" class="subprocess" compatibility="9.0.003" expanded="true" height="82" name="Extract Links" width="90" x="447" y="34">
                <process expanded="true">
                  <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="45" y="136">
                    <list key="replace_dictionary">
                      <parameter key="\r?\n" value=" "/>
                      <parameter key="[ ]+" value=" "/>
                    </list>
                  </operator>
                  <operator activated="true" class="text:keep_document_parts" compatibility="8.1.000" expanded="true" height="68" name="Keep Document Parts (2)" width="90" x="45" y="34">
                    <parameter key="extraction_regex" value="(?s)&lt;a .*?&gt;"/>
                  </operator>
                  <operator activated="true" class="text:combine_documents" compatibility="8.1.000" expanded="true" height="82" name="Combine Documents" width="90" x="179" y="34"/>
                  <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="313" y="34">
                    <parameter key="query_type" value="Regular Region"/>
                    <list key="string_machting_queries"/>
                    <list key="regular_expression_queries"/>
                    <list key="regular_region_queries">
                      <parameter key="link" value="&lt;.&gt;"/>
                    </list>
                    <list key="xpath_queries"/>
                    <list key="namespaces"/>
                    <list key="index_queries"/>
                    <list key="jsonpath_queries"/>
                    <process expanded="true">
                      <connect from_port="segment" to_port="document 1"/>
                      <portSpacing port="source_segment" spacing="0"/>
                      <portSpacing port="sink_document 1" spacing="0"/>
                      <portSpacing port="sink_document 2" spacing="0"/>
                    </process>
                  </operator>
                  <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="447" y="34">
                    <parameter key="text_attribute" value="link"/>
                  </operator>
                  <operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34">
                    <parameter key="attribute_filter_type" value="single"/>
                    <parameter key="attribute" value="link"/>
                  </operator>
                  <operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="715" y="34">
                    <list key="filters_list">
                      <parameter key="filters_entry_key" value="link.does_not_contain.www\.gelbeseiten\.de"/>
                      <parameter key="filters_entry_key" value="link.contains.class=&quot;link&quot;"/>
                    </list>
                  </operator>
                  <operator activated="true" class="replace" compatibility="9.0.003" expanded="true" height="82" name="Replace" width="90" x="849" y="34">
                    <parameter key="attribute_filter_type" value="single"/>
                    <parameter key="attribute" value="link"/>
                    <parameter key="replace_what" value="^.*?href=&quot;(.*?)&quot;.*"/>
                    <parameter key="replace_by" value="$1"/>
                  </operator>
                  <connect from_port="in 1" to_op="Replace Tokens" to_port="document"/>
                  <connect from_op="Replace Tokens" from_port="document" to_op="Keep Document Parts (2)" to_port="document"/>
                  <connect from_op="Keep Document Parts (2)" from_port="document" to_op="Combine Documents" to_port="documents 1"/>
                  <connect from_op="Combine Documents" from_port="document" to_op="Cut Document" to_port="document"/>
                  <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
                  <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
                  <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
                  <connect from_op="Filter Examples" from_port="example set output" to_op="Replace" to_port="example set input"/>
                  <connect from_op="Replace" from_port="example set output" to_port="out 1"/>
                  <portSpacing port="source_in 1" spacing="0"/>
                  <portSpacing port="source_in 2" spacing="0"/>
                  <portSpacing port="sink_out 1" spacing="0"/>
                  <portSpacing port="sink_out 2" spacing="0"/>
                </process>
              </operator>
              <connect from_op="Get Page" from_port="output" to_op="Extract Links" to_port="in 1"/>
              <connect from_op="Extract Links" from_port="out 1" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
    


Answers

  • kayman
    kayman New Altair Community Member
    Are you behind a firewall or proxy?
    As far as I know they work still as before (at least for me :-)) so unless your network has changed it should be ok.

    Can you still access the marketplace? This usually is a good indication that you can at least access the internet through RM. If not check your preferences -> proxy

    Another possible scenario is that your site changed protocol, and is no longer using http but https. So while the url might still look the same at first glance your request might get blocked. 
  • tu_162092
    tu_162092 New Altair Community Member

    Thanks for your answer.

    I am not using a proxy and there is also a connection to the marketplace. But no matter which URL I want to crawl, URL's will never be found.

    If I check the URL structure of old processes and it hasn't changed, it should still work, right? 

    I really can't explain why.
  • kayman
    kayman New Altair Community Member
    Could you share your process? There is indeed no reason why it wouldn't work, but without more details it's hard to see which direction to look at.
  • tu_162092
    tu_162092 New Altair Community Member
    I can't post pictures or links because I'm still new in the community. Can I send you pictures by e-mail?
  • tu_162092
    tu_162092 New Altair Community Member
    Ok here are the pictures....
  • kayman
    kayman New Altair Community Member
    could you try using .* as pattern?
    Your current expression is /*, which basically means take everything that ends with a /, and I don't care how many slashes there are.

    Using /.* (dot star) you state 'give me anything available behind the slash, as many times it occurs.'

    One thing I always recommend is to get at least the main page, or one of the links directly before trying the crawl logic. This way you are assured you can already get the page one way or another
  • tu_162092
    tu_162092 New Altair Community Member
  • tu_162092
    tu_162092 New Altair Community Member
  • tu_162092
    tu_162092 New Altair Community Member
  • tu_162092
    tu_162092 New Altair Community Member
    URL: https://www.gelbeseiten.de/reisebueros/berlin
    URL i want to crawl: https://www.gelbeseiten.de/gsbiz/

    I can't post pictures because i'm still new in the community. So here is a link to Google Drive. There are pictures of the process.

    https://drive.google.com/drive/folders/1PWt9zS2azBoR5DAhwI8Y17zetBTauUJ1?usp=sharing
  • sgenzer
    sgenzer
    Altair Employee
    hi @tu_162092 I am sorry about the permissions but we are getting an increasing amount of spammers here and this is the only way to block them. If you have more issues please send me a DM.

    Scott

  • kayman
    kayman New Altair Community Member
    hi @tu_162092, did you try my suggestion (so using .* instead of *) as your screenshots say otherwise.
  • tu_162092
    tu_162092 New Altair Community Member
    Okay, this is embarrassing :D . You were right, with .* it works. Many thanks for your help!!!!

    But now I have another question that you can help me with.

    How do I adjust the crawling rules so that he follows the links on the entry page and then goes to the next page and does the same there again?

    Greetings
    Tim
  • kayman
    kayman New Altair Community Member
    No problem, happens to me also on a regular base :-)

    As for rules, If I recall right this is handled with setting the max crawl depth, try by changing it to 3 or more.
    When 2 it will take main page and the next one, with 3 it will also take the next one and so on.

  • tu_162092
    tu_162092 New Altair Community Member
    @kayman Thanks again for your tip. I will test it :).

    You can ignore the contributions with the screenshots! The posts were first blocked and now all unlocked.

    @sgenzer No problem!
  • Telcontar120
    Telcontar120 New Altair Community Member
    As @kayman says, the crawl depth will determine how many consecutive pages will be followed.  But be careful with this because it can greatly increase the number of results that are returned, and this operator can be quite slow. You can try making the crawling rule more page specific and that sometimes helps.  You also should determine whether you need to do both rules (follow and store, you have both in the screenshot above)---typically both are not needed.  Storing is useful if you want to keep all the raw html files, but if you are processing it all in RapidMiner and then converting it into an exampleset then usually you don't need both.

  • tu_162092
    tu_162092 New Altair Community Member
    @Telcontar120 Thank you for your help.

    I would like the rapidminer to open each profile in the listing of yellow pages and pull the information there. If he has opened all profiles on one page, he should go to the next page and open all profiles there again, until I have all information of all profiles on all pages.

    With the above process I can either open all profiles on one page or open all listing pages. I can't do both.

    Unfortunately I am a real beginner. Can I solve the problem with this operator or do I need other operators?
  • kayman
    kayman New Altair Community Member
    Answer ✓
    You could use the loop operator for this specific case.
    Your site has 45 listings pages (670 links, 15 by page), so set up a loop for 45 iterations, and do something per loop

    Attached example would load the next page per iteration (url/s[page_number]), get all the links (<a spans) and leave the ones needed.
    This way you can build up a list of all the links, and that list can then be used to start crawling the other sites.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="246" y="34">
            <parameter key="number_of_iterations" value="45"/>
            <parameter key="iteration_macro" value="page"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="313" y="34">
                <parameter key="url" value="https://www.gelbeseiten.de/reisebueros/berlin/s%{page}"/>
                <list key="query_parameters"/>
                <list key="request_properties"/>
              </operator>
              <operator activated="true" breakpoints="after" class="subprocess" compatibility="9.0.003" expanded="true" height="82" name="Extract Links" width="90" x="447" y="34">
                <process expanded="true">
                  <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="45" y="136">
                    <list key="replace_dictionary">
                      <parameter key="\r?\n" value=" "/>
                      <parameter key="[ ]+" value=" "/>
                    </list>
                  </operator>
                  <operator activated="true" class="text:keep_document_parts" compatibility="8.1.000" expanded="true" height="68" name="Keep Document Parts (2)" width="90" x="45" y="34">
                    <parameter key="extraction_regex" value="(?s)&lt;a .*?&gt;"/>
                  </operator>
                  <operator activated="true" class="text:combine_documents" compatibility="8.1.000" expanded="true" height="82" name="Combine Documents" width="90" x="179" y="34"/>
                  <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="313" y="34">
                    <parameter key="query_type" value="Regular Region"/>
                    <list key="string_machting_queries"/>
                    <list key="regular_expression_queries"/>
                    <list key="regular_region_queries">
                      <parameter key="link" value="&lt;.&gt;"/>
                    </list>
                    <list key="xpath_queries"/>
                    <list key="namespaces"/>
                    <list key="index_queries"/>
                    <list key="jsonpath_queries"/>
                    <process expanded="true">
                      <connect from_port="segment" to_port="document 1"/>
                      <portSpacing port="source_segment" spacing="0"/>
                      <portSpacing port="sink_document 1" spacing="0"/>
                      <portSpacing port="sink_document 2" spacing="0"/>
                    </process>
                  </operator>
                  <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="447" y="34">
                    <parameter key="text_attribute" value="link"/>
                  </operator>
                  <operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34">
                    <parameter key="attribute_filter_type" value="single"/>
                    <parameter key="attribute" value="link"/>
                  </operator>
                  <operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="715" y="34">
                    <list key="filters_list">
                      <parameter key="filters_entry_key" value="link.does_not_contain.www\.gelbeseiten\.de"/>
                      <parameter key="filters_entry_key" value="link.contains.class=&quot;link&quot;"/>
                    </list>
                  </operator>
                  <operator activated="true" class="replace" compatibility="9.0.003" expanded="true" height="82" name="Replace" width="90" x="849" y="34">
                    <parameter key="attribute_filter_type" value="single"/>
                    <parameter key="attribute" value="link"/>
                    <parameter key="replace_what" value="^.*?href=&quot;(.*?)&quot;.*"/>
                    <parameter key="replace_by" value="$1"/>
                  </operator>
                  <connect from_port="in 1" to_op="Replace Tokens" to_port="document"/>
                  <connect from_op="Replace Tokens" from_port="document" to_op="Keep Document Parts (2)" to_port="document"/>
                  <connect from_op="Keep Document Parts (2)" from_port="document" to_op="Combine Documents" to_port="documents 1"/>
                  <connect from_op="Combine Documents" from_port="document" to_op="Cut Document" to_port="document"/>
                  <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
                  <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
                  <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
                  <connect from_op="Filter Examples" from_port="example set output" to_op="Replace" to_port="example set input"/>
                  <connect from_op="Replace" from_port="example set output" to_port="out 1"/>
                  <portSpacing port="source_in 1" spacing="0"/>
                  <portSpacing port="source_in 2" spacing="0"/>
                  <portSpacing port="sink_out 1" spacing="0"/>
                  <portSpacing port="sink_out 2" spacing="0"/>
                </process>
              </operator>
              <connect from_op="Get Page" from_port="output" to_op="Extract Links" to_port="in 1"/>
              <connect from_op="Extract Links" from_port="out 1" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • tu_162092
    tu_162092 New Altair Community Member
    @Telcontar120 Thank you for your help.

    I would like the rapidminer to open each profile (e.g. https://www.gelbeseiten.de/gsbiz/f0c65462-3748-48d8-85be-8635269ca1fd) in the listing (e.g. https://www.gelbeseiten.de/reisebueros/frankfurt-am-main) of yellow pages and pull the information there. If he has opened all profiles on one page, he should go to the next page and open all profiles there again, until I have all information of all profiles on all pages.

    With the above process I can either open all profiles on one page or open all listing pages. I can't do both.

    Unfortunately I am a real beginner. Can I solve the problem with this operator or do I need other operators?
  • Telcontar120
    Telcontar120 New Altair Community Member
    Did you see the suggested solution that @kayman provided?  It seems like that is doing what you are requesting (or could be adapted pretty easily).  Using the Loop with the page number in the URL query is definitely a workable solution, I have used it several times in the past myself.

  • tu_162092
    tu_162092 New Altair Community Member

    I'm really sorry I didn't get back to you until now. I have tested @kayman process and it works. Thank you very much for your help! This is a very cool community.

    I already wish you a Merry Christmas :)!
  • sgenzer
    sgenzer
    Altair Employee
    very glad to have you here @tu_162092. Happy holidays to you as well!

    Scott