Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Webcrawling - problem with storing sites

Hey,

I'm working on a web crawling project to analyse various crowdfunding sites' projects via text mining in Rapidminer 5. I have already built a working text analyser, but I'm stuck at the web crawling part. The problem is that the web crawler does crawl through the requested sites, but doesn't store them. I have tried experimenting with page size, depth and the like, but still the program just skips those sites. It is probable that the problem is with my storing rules. They look like the following, when trying to crawl through Kickstarter's sites:

Follow with matching URL:

.+kickstarter.+

Store with matching URL:

https://www\.kickstarter\.com\/projects.+
http://www\.kickstarter\.com\/projects.+
(?i)http.*://www\.kickstarter\.com\/projects.+

An example URL that would need to be stored is:

http://www.kickstarter.com/projects/corvuse/bhaloidam-an-indie-tabletop-storytelling-game?ref=spotlight

(no advertising intended)

And the log looks like the following:

Mar 12, 2014 11:50:37 AM INFO: Following link http://www.kickstarter.com/projects/corvuse/bhaloidam-an-indie-tabletop-storytelling-game?ref=spotlight
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/post/12036057734/todays-project-of-the-day-is-bhaloidam-an-indie
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/tagged/bhaloidam
Mar 12, 2014 11:50:38 AM INFO: Discarded page "http://kickstarter.tumblr.com/post/79165806431/do-you-like-coloring-and-also-have-questions" because url does not match filter rules.

As you can see, it follows through with the process and just skips these links, and it doesn't even say that it doesn't match the filter rules so it's been discarded, so I'm not even sure that in these cases the program compares the links to the rules. I see a lot of links in the log preceded with ("Following link..") but very few preceded with ("Discarded page..."). Does this mean that it just checks a few pages, or just that it won't notify me for every discarded page?

Thanks in advance!
Cheers

Find more posts tagged with

AI Studio

Accepted answers

All comments

I have had the same issue.

My parameters:
url: http://connect.jems.com/profiles/blog/list?tag=EMS
store with url, follow with url: .+blog.+
output directory: C:\Program Files\Rapid-I\myfiles\webcrawl
extension: html
max pages: 20
max depth: 20
domain: web
delay: 500
max threads: 2
max page size: 500
obey robot exclusion: T

The output provides the first page. This is contrary to simafore instruction and example
http://www.simafore.com/blog/bid/112223/text-mining-how-to-fine-tune-job-searches-using-web-crawling
which states that the last file is stored.

I also tried to follow vancouver blog spot
https://www.youtube.com/watch?v=zMyrw0HsREg#t=13
and duplicate the result. For all of my runs, it only shows the first page, although the log shows that it obeys and follows the follow link rule.

Any help would be greatly appreciated! I am getting really frustrated with this.

My code is below:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="75">
<parameter key="url" value="http://connect.jems.com/profiles/blog/list?tag=EMS"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+blog.+"/>
<parameter key="follow_link_with_matching_url" value=".+blog.+"/>
</list>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="output_dir" value="C:\Program Files\Rapid-I\myfiles\webcrawl"/>
<parameter key="extension" value="html"/>
<parameter key="max_pages" value="20"/>
<parameter key="max_depth" value="20"/>
<parameter key="delay" value="500"/>
<parameter key="max_threads" value="2"/>
<parameter key="max_page_size" value="500"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

I was able to resolve my issue (and you may be able to resolve yours) by working on the user agent name. Using the site that was recommended by vancouver data (whatismyuseragent), it reported a long string with punctuation (parenthesis, semicolons, etc.) I revised it to just text with periods. It worked fine after that.