Hi,
I want to crawl the site:
https://www.bmwgroup.com/en.htmlAnd some of other companies.
Therfore I used the regex .+bmwgroup.+en.+
I use "en" because I just want to crawl the sites in english language and intentionally not "/en" because some sites include the en without a /.
The problem is that the crawler crawls all social media share links, too. And thus the process of crawling lasts like forever because the share links of facebook and co including the regex too.
How can I exclude facebook, linkedin, twitter and co?
I tried something like .+(?!facebook)bmwgroup.+en.+ but unsuccessful.
You have any ideas. Additionally I have to say I can't use a regex like: https\:\/\/www\.bmwgroup.+en.+ to avoid to crawl any sites not starting with
https://www.bmwgroup, because other links in this site are just http or beginn with
http://w3.bmwgroup and so these site would be ignored. But I want to crawl all links but not socialmedia links.
Could you please help?