"Web mining - crawling rules"
gingernissan
New Altair Community Member
Hi i am new to Rapid Miner. I have a site i want to craw and extract/download pages. The pages i am interested in have a common URL (http://items.mywebsite.ie/for-sale/laptops/3254621) . The starting URL i am using is the site search page containing the links to the relative pages (http://items.mywebsite.ie/find/for-sale/laptops/
My overall goal of this is to pull a list of say 20 pages in this relevant format. The number is the page id but it is not relevant to the laptop section, it is site wide.
I have tried several variations of the store_with_matching_url and Follow_link_with_matching_url in an attempt to follow links with the word laptop and then subsequently store the ones that have a 7 digit number at the end.
"http://items.mywebsitel.ie\for-sale\laptops\.+[0-9]"
'http://items.mywebsite.ie\for-sale\laptops\.+[0-9]'
(^)http://items.mywebsite.ie\for-sale\laptops\.+[0-9]($)
.+[0-9]
.+laptops.+
.+laptops.+|.+[0-9]
.[0-9][0-9][0-9][0-9][0-9][0-9][0-9]
Can anyone help me out of point me in the right direction?
Any help would be greatly appreciated, Thanks
My overall goal of this is to pull a list of say 20 pages in this relevant format. The number is the page id but it is not relevant to the laptop section, it is site wide.
I have tried several variations of the store_with_matching_url and Follow_link_with_matching_url in an attempt to follow links with the word laptop and then subsequently store the ones that have a 7 digit number at the end.
"http://items.mywebsitel.ie\for-sale\laptops\.+[0-9]"
'http://items.mywebsite.ie\for-sale\laptops\.+[0-9]'
(^)http://items.mywebsite.ie\for-sale\laptops\.+[0-9]($)
.+[0-9]
.+laptops.+
.+laptops.+|.+[0-9]
.[0-9][0-9][0-9][0-9][0-9][0-9][0-9]
Can anyone help me out of point me in the right direction?
Any help would be greatly appreciated, Thanks
Tagged:
0
Answers
-
so with several more persistent hours i managed to figure it out using :
store .+for-sale/Laptops/.+
follow .+Laptops.+
It's so obvious now, i should have got it earlier !0