"embedded crawler (websphinx) and RegEx"

New Altair Community Member

Nov 17, 2008

Updated Nov 5, 2024 by Jocelyn

(How) can I use RegEx within that crawler? It did not work...

I tried this several times as follows (see also attachement):
visit_content: ^water$
or
visit_content: \<water\>
or
visit_content: (?s)\<water\>
...

(I don't want waterfall...)

Please don't suggest HTTRACK. As far as I know HTTRACK can not filter the content of pages but only URLs.

[attachment deleted by admin]

Find more posts tagged with

AI Studio

Web Mining

RegEx

Sort by:

1 - 1 of 11

land

New Altair Community Member

Nov 25, 2008

Hi,
the crawler does not support regular expressions. This are the only condition types are supported to specify which links to follow:
follow_url A link is only followed, if the target URL contains all terms stated in this parameter.
link_text A link is only followed, if the link text contains all terms stated in this parameter.

The conditions that state whether to store a page or not allow for the following expressions:
visit_url A page is only stored if its URL contains all terms stated in this parameter.
visit_content A page is only stored if its content contains all terms stated in this parameter.

Further informations could be found on http://nemoz.org/joomla/content/view/64/53/lang,de/

Greetings,
Sebastian

🎉Community Raffle - Win $25

"embedded crawler (websphinx) and RegEx"

Find more posts tagged with

Quick Links