I have been using Rapidminer for a while and have some experience using web crawling without major problems. But one new assignment has me puzzled.
Url's are like this:
http:\\www.movilauto.com\toyota rav4 2012.html
http:\\www.movilauto.com\bmw 320 2013.html
I normally would used .+movilauto.+ to get these pages and it would work out pretty well. But apparently spaces are a problem.
To complicate even further the number or spaces are not fixed, sometimes there are 2 like in the previous example and sometimes there are three, like in the following example
http:\\www.movilauto.com\toyota rav4 automatic 2012.html
Any suggestions?