🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Webcrawling - problem with storing sites

User: "blint"
New Altair Community Member
Updated by Jocelyn
Hey,

I'm working on a web crawling project to analyse various crowdfunding sites' projects via text mining in Rapidminer 5. I have already built a working text analyser, but I'm stuck at the web crawling part. The problem is that the web crawler does crawl through the requested sites, but doesn't store them. I have tried experimenting with page size, depth and the like, but still the program just skips those sites. It is probable that the problem is with my storing rules. They look like the following, when trying to crawl through Kickstarter's sites:

Follow with matching URL:
.+kickstarter.+
Store with matching URL:
An example URL that would need to be stored is:
(no advertising intended)

And the log looks like the following:
Mar 12, 2014 11:50:37 AM INFO: Following link http://www.kickstarter.com/projects/corvuse/bhaloidam-an-indie-tabletop-storytelling-game?ref=spotlight
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/post/12036057734/todays-project-of-the-day-is-bhaloidam-an-indie
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/tagged/bhaloidam
Mar 12, 2014 11:50:38 AM INFO: Discarded page "http://kickstarter.tumblr.com/post/79165806431/do-you-like-coloring-and-also-have-questions" because url does not match filter rules.
As you can see, it follows through with the process and just skips these links, and it doesn't even say that it doesn't match the filter rules so it's been discarded, so I'm not even sure that in these cases the program compares the links to the rules. I see a lot of links in the log preceded with ("Following link..") but very few preceded with ("Discarded page..."). Does this mean that it just checks a few pages, or just that it won't notify me for every discarded page?

Thanks in advance!
Cheers

Find more posts tagged with