"Crawling Google search results"
geschwader
New Altair Community Member
Hi. This forum helped me with my previous task (http://rapid-i.com/rapidforum/index.php/topic,3446), so I hope you can help me now too.
Here is what I want to do. I have Google search results for query "Putin" (just for an example) with the option "show results of the past 24 hours":
http://www.google.com.ua/search?q=Putin&;hl=en&safe=off&prmd=imvnsul&source=lnt&tbs=qdr:d&sa=X&ei=iLGMTr-wN6On0QXioM3pBQ&ved=0CA0QpwUoAg&biw=1280&bih=713
Now I want to retrieve all the results with Crawl Web operator. The task looked quite simple for me, but what I did haven't worked :-[
So, I put http://www.google.com.ua/search?q=Putin&hl=en&safe=off&prmd=imvnsul&source=lnt&tbs=qdr:d&sa=X&ei=iLGMTr-wN6On0QXioM3pBQ&ved=0CA0QpwUoAg&biw=1280&bih=713 as starting URL. Then entered several crawling rules:
Here is what I want to do. I have Google search results for query "Putin" (just for an example) with the option "show results of the past 24 hours":
http://www.google.com.ua/search?q=Putin&;hl=en&safe=off&prmd=imvnsul&source=lnt&tbs=qdr:d&sa=X&ei=iLGMTr-wN6On0QXioM3pBQ&ved=0CA0QpwUoAg&biw=1280&bih=713
Now I want to retrieve all the results with Crawl Web operator. The task looked quite simple for me, but what I did haven't worked :-[
So, I put http://www.google.com.ua/search?q=Putin&hl=en&safe=off&prmd=imvnsul&source=lnt&tbs=qdr:d&sa=X&ei=iLGMTr-wN6On0QXioM3pBQ&ved=0CA0QpwUoAg&biw=1280&bih=713 as starting URL. Then entered several crawling rules:
- follow_link_with_matching_url http://www.google.com.ua/url?sa=t&;source=web&cd= (as it's the unchangeable part of all individual results links)
- follow_link_with_matching_url http://www.google.com.ua/search?q=Putin&;hl=en&safe=off&biw=1280&bih=713&tbs=qdr:d&prmd=imvnsul&ei=n7GMTpquCIeb-gaHlPDhCg&start= (as it's the unchangeable part of all results list pages)
- store_with_matching_content Putin (to avoid pages with no relevant content)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>Couple of seconds passed... And I have no pages retrieved. What's the problem?
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<process expanded="true" height="-20" width="-50">
<operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="108" y="135">
<parameter key="url" value="http://www.google.com.ua/search?q=Putin&hl=en&safe=off&prmd=imvnsul&source=lnt&tbs=qdr:d&sa=X&ei=iLGMTr-wN6On0QXioM3pBQ&ved=0CA0QpwUoAg&biw=1280&bih=713"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="http://www.google.com.ua/url?sa=t&source=web&cd="/>
<parameter key="store_with_matching_url" value="http://www.google.com.ua/search?q=Putin&hl=en&safe=off&biw=1280&bih=713&tbs=qdr:d&prmd=imvnsul&ei=n7GMTpquCIeb-gaHlPDhCg&start="/>
<parameter key="store_with_matching_content" value="Putin"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="max_depth" value="100"/>
<parameter key="delay" value="1000000"/>
<parameter key="max_page_size" value="1000000"/>
<parameter key="user_agent" value="Opera"/>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="really_ignore_exclusion" value="true"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
-
UP! I think the problem is that Google blocks requests from Rapidminer. Any ways to avoid this?0
-
Has anyone found a solution to this problem? Alternatively, is there a way to extract the link urls from a google search by some other means?0