Webcrawling - problem with storing sites
Hey,
I'm working on a web crawling project to analyse various crowdfunding sites' projects via text mining in Rapidminer 5. I have already built a working text analyser, but I'm stuck at the web crawling part. The problem is that the web crawler does crawl through the requested sites, but doesn't store them. I have tried experimenting with page size, depth and the like, but still the program just skips those sites. It is probable that the problem is with my storing rules. They look like the following, when trying to crawl through Kickstarter's sites:
Follow with matching URL:
And the log looks like the following:
Thanks in advance!
Cheers
I'm working on a web crawling project to analyse various crowdfunding sites' projects via text mining in Rapidminer 5. I have already built a working text analyser, but I'm stuck at the web crawling part. The problem is that the web crawler does crawl through the requested sites, but doesn't store them. I have tried experimenting with page size, depth and the like, but still the program just skips those sites. It is probable that the problem is with my storing rules. They look like the following, when trying to crawl through Kickstarter's sites:
Follow with matching URL:
Store with matching URL:
.+kickstarter.+
An example URL that would need to be stored is:
https://www\.kickstarter\.com\/projects.+
http://www\.kickstarter\.com\/projects.+
(?i)http.*://www\.kickstarter\.com\/projects.+
(no advertising intended)
And the log looks like the following:
As you can see, it follows through with the process and just skips these links, and it doesn't even say that it doesn't match the filter rules so it's been discarded, so I'm not even sure that in these cases the program compares the links to the rules. I see a lot of links in the log preceded with ("Following link..") but very few preceded with ("Discarded page..."). Does this mean that it just checks a few pages, or just that it won't notify me for every discarded page?
Mar 12, 2014 11:50:37 AM INFO: Following link http://www.kickstarter.com/projects/corvuse/bhaloidam-an-indie-tabletop-storytelling-game?ref=spotlight
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/post/12036057734/todays-project-of-the-day-is-bhaloidam-an-indie
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/tagged/bhaloidam
Mar 12, 2014 11:50:38 AM INFO: Discarded page "http://kickstarter.tumblr.com/post/79165806431/do-you-like-coloring-and-also-have-questions" because url does not match filter rules.
Thanks in advance!
Cheers