"Crawling Google search results"

geschwader
geschwader New Altair Community Member
edited November 5 in Community Q&A
Hi. This forum helped me with my previous task (http://rapid-i.com/rapidforum/index.php/topic,3446), so I hope you can help me now too.

Here is what I want to do. I have Google search results for query "Putin" (just for an example) with the option "show results of the past 24 hours":
http://www.google.com.ua/search?q=Putin&;hl=en&safe=off&prmd=imvnsul&source=lnt&tbs=qdr:d&sa=X&ei=iLGMTr-wN6On0QXioM3pBQ&ved=0CA0QpwUoAg&biw=1280&bih=713
Now I want to retrieve all the results with Crawl Web operator. The task looked quite simple for me, but what I did haven't worked  :-[
So, I put http://www.google.com.ua/search?q=Putin&hl=en&safe=off&prmd=imvnsul&source=lnt&tbs=qdr:d&sa=X&ei=iLGMTr-wN6On0QXioM3pBQ&ved=0CA0QpwUoAg&biw=1280&bih=713 as starting URL. Then entered several crawling rules:
  • follow_link_with_matching_url http://www.google.com.ua/url?sa=t&;source=web&cd= (as it's the unchangeable part of all individual results links)
  • follow_link_with_matching_url http://www.google.com.ua/search?q=Putin&;hl=en&safe=off&biw=1280&bih=713&tbs=qdr:d&prmd=imvnsul&ei=n7GMTpquCIeb-gaHlPDhCg&start= (as it's the unchangeable part of all  results list pages)
  • store_with_matching_content Putin (to avoid pages with no relevant content)
The whole process code is the following:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
   <process expanded="true" height="-20" width="-50">
     <operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="108" y="135">
       <parameter key="url" value="http://www.google.com.ua/search?q=Putin&amp;hl=en&amp;safe=off&amp;prmd=imvnsul&amp;source=lnt&amp;tbs=qdr:d&amp;sa=X&amp;ei=iLGMTr-wN6On0QXioM3pBQ&amp;ved=0CA0QpwUoAg&amp;biw=1280&amp;bih=713"/>
       <list key="crawling_rules">
         <parameter key="follow_link_with_matching_url" value="http://www.google.com.ua/url?sa=t&amp;source=web&amp;cd="/>
         <parameter key="store_with_matching_url" value="http://www.google.com.ua/search?q=Putin&amp;hl=en&amp;safe=off&amp;biw=1280&amp;bih=713&amp;tbs=qdr:d&amp;prmd=imvnsul&amp;ei=n7GMTpquCIeb-gaHlPDhCg&amp;start="/>
         <parameter key="store_with_matching_content" value="Putin"/>
       </list>
       <parameter key="write_pages_into_files" value="false"/>
       <parameter key="add_pages_as_attribute" value="true"/>
       <parameter key="max_depth" value="100"/>
       <parameter key="delay" value="1000000"/>
       <parameter key="max_page_size" value="1000000"/>
       <parameter key="user_agent" value="Opera"/>
       <parameter key="obey_robot_exclusion" value="false"/>
       <parameter key="really_ignore_exclusion" value="true"/>
     </operator>
     <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
Couple of seconds passed... And I have no pages retrieved. What's the problem?

Answers

  • geschwader
    geschwader New Altair Community Member
    UP! I think the problem is that Google blocks requests from Rapidminer. Any ways to avoid this?
  • jforr
    jforr New Altair Community Member
    Has anyone found a solution to this problem?  Alternatively, is there a way to extract the link urls from a google search by some other means?