Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

"embedded crawler (websphinx) and RegEx"

(How) can I use RegEx within that crawler? It did not work...

I tried this several times as follows (see also attachement):
visit_content: ^water$
or
visit_content: \<water\>
or
visit_content: (?s)\<water\>
...

(I don't want waterfall...)

Please don't suggest HTTRACK. As far as I know HTTRACK can not filter the content of pages but only URLs.

[attachment deleted by admin]

Find more posts tagged with

AI Studio

Web Mining

RegEx

Accepted answers

All comments

Hi,
the crawler does not support regular expressions. This are the only condition types are supported to specify which links to follow:
follow_url A link is only followed, if the target URL contains all terms stated in this parameter.
link_text A link is only followed, if the link text contains all terms stated in this parameter.

The conditions that state whether to store a page or not allow for the following expressions:
visit_url A page is only stored if its URL contains all terms stated in this parameter.
visit_content A page is only stored if its content contains all terms stated in this parameter.

Further informations could be found on http://nemoz.org/joomla/content/view/64/53/lang,de/

Greetings,
Sebastian

Quick Links