"Web mining - crawling rules"

gingernissan
gingernissan New Altair Community Member
edited November 5 in Community Q&A
Hi i am new to Rapid Miner. I have a site i want to craw and extract/download pages. The pages i am interested in have a common URL (http://items.mywebsite.ie/for-sale/laptops/3254621) . The starting URL i am using is the site search page containing the links to the relative pages (http://items.mywebsite.ie/find/for-sale/laptops/
My overall goal of this is to pull a list of say 20 pages in this relevant format. The number is the page id but it is not relevant to the laptop section, it is site wide.

I have tried several variations of the store_with_matching_url and Follow_link_with_matching_url in an attempt to follow links with the word laptop and then subsequently store the ones that have a 7 digit number at the end.

"http://items.mywebsitel.ie\for-sale\laptops\.+[0-9]"
'http://items.mywebsite.ie\for-sale\laptops\.+[0-9]'
(^)http://items.mywebsite.ie\for-sale\laptops\.+[0-9]($)
.+[0-9]
.+laptops.+
.+laptops.+|.+[0-9]
.[0-9][0-9][0-9][0-9][0-9][0-9][0-9]

Can anyone help me out of point me in the right direction?

Any help would be greatly appreciated, Thanks

Answers

  • gingernissan
    gingernissan New Altair Community Member
    so with several more persistent hours i managed to figure it out using :
    store    .+for-sale/Laptops/.+
    follow    .+Laptops.+

    It's so obvious now, i should have got it earlier !