"Crawling Google search results"

geschwader · October 2011

Hi. This forum helped me with my previous task (http://rapid-i.com/rapidforum/index.php/topic,3446), so I hope you can help me now too.

Here is what I want to do. I have Google search results for query "Putin" (just for an example) with the option "show results of the past 24 hours":
http://www.google.com.ua/search?q=Putin&;hl=en&safe=off&prmd=imvnsul&source=lnt&tbs=qdr:d&sa=X&ei=iLGMTr-wN6On0QXioM3pBQ&ved=0CA0QpwUoAg&biw=1280&bih=713
Now I want to retrieve all the results with Crawl Web operator. The task looked quite simple for me, but what I did haven't worked :-[
So, I put http://www.google.com.ua/search?q=Putin&hl=en&safe=off&prmd=imvnsul&source=lnt&tbs=qdr:d&sa=X&ei=iLGMTr-wN6On0QXioM3pBQ&ved=0CA0QpwUoAg&biw=1280&bih=713 as starting URL. Then entered several crawling rules:

follow_link_with_matching_url http://www.google.com.ua/url?sa=t&;source=web&cd= (as it's the unchangeable part of all individual results links)
follow_link_with_matching_url http://www.google.com.ua/search?q=Putin&;hl=en&safe=off&biw=1280&bih=713&tbs=qdr:d&prmd=imvnsul&ei=n7GMTpquCIeb-gaHlPDhCg&start= (as it's the unchangeable part of all results list pages)
store_with_matching_content Putin (to avoid pages with no relevant content)

The whole process code is the following:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <process expanded="true" height="-20" width="-50">
      <operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="108" y="135">
        <parameter key="url" value="http://www.google.com.ua/search?q=Putin&amp;hl=en&amp;safe=off&amp;prmd=imvnsul&amp;source=lnt&amp;tbs=qdr:d&amp;sa=X&amp;ei=iLGMTr-wN6On0QXioM3pBQ&amp;ved=0CA0QpwUoAg&amp;biw=1280&amp;bih=713"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://www.google.com.ua/url?sa=t&amp;source=web&amp;cd="/>
          <parameter key="store_with_matching_url" value="http://www.google.com.ua/search?q=Putin&amp;hl=en&amp;safe=off&amp;biw=1280&amp;bih=713&amp;tbs=qdr:d&amp;prmd=imvnsul&amp;ei=n7GMTpquCIeb-gaHlPDhCg&amp;start="/>
          <parameter key="store_with_matching_content" value="Putin"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="add_pages_as_attribute" value="true"/>
        <parameter key="max_depth" value="100"/>
        <parameter key="delay" value="1000000"/>
        <parameter key="max_page_size" value="1000000"/>
        <parameter key="user_agent" value="Opera"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Couple of seconds passed... And I have no pages retrieved. What's the problem?

geschwader · October 2011

UP! I think the problem is that Google blocks requests from Rapidminer. Any ways to avoid this?

jforr · July 2012

Has anyone found a solution to this problem? Alternatively, is there a way to extract the link urls from a google search by some other means?

"Crawling Google search results"

Answers

Categories