The Problem with RM Crawling Rules - how they are explained
leptserkhan
New Altair Community Member
The problem with Rapid Miner crawling rules, and I think a big reason that people are not getting the results they think they should get is that the documentation -- as far as how the four rules work -- is minimal, at best. Here are the explanations provided:
*store_with_matching_url:If the regular expression matches the url, this page will be stored in the resulting ExampleSet.
*store_with_matching_content:If the regular expression matches the page content, this page will be stored in the resulting ExampleSet.
*follow_link_with_matching_url:If the regular expression matches the url, the crawler will follow the link and load the url.
*follow_link_with_matching_text:If the regular expression matches the text of the hyperlink, the crawler will follow the link and load the according url.
There is absolutely no explanation as to importance of *precedence* if that even applies and if it doesn't it should be stated so people don't spend time switching around the rules to experiment which method of precedence could possibly work.
"follow_link_with_matching_text. . . . follow the link and load the according url." Besides the fact that ". . . load the according url." is bad grammar and only serves to confuse a proper English speaker, does this mean load the page from the URL which contained the original link that is being followed, or load the page for the page that is landed upon after following the link with the matching text?
You can see that with just two improperly explained rules and the potential permutations of them in combination with the other rules, how this can lead to mayhem. And apparently based on the requests for help, that's what is happening.
Please *clarify* how the rules work and provide an easily-found link in the main dashboard that does exactly that with examples.
Otherwise a great product.
Thank you.
*store_with_matching_url:If the regular expression matches the url, this page will be stored in the resulting ExampleSet.
*store_with_matching_content:If the regular expression matches the page content, this page will be stored in the resulting ExampleSet.
*follow_link_with_matching_url:If the regular expression matches the url, the crawler will follow the link and load the url.
*follow_link_with_matching_text:If the regular expression matches the text of the hyperlink, the crawler will follow the link and load the according url.
There is absolutely no explanation as to importance of *precedence* if that even applies and if it doesn't it should be stated so people don't spend time switching around the rules to experiment which method of precedence could possibly work.
"follow_link_with_matching_text. . . . follow the link and load the according url." Besides the fact that ". . . load the according url." is bad grammar and only serves to confuse a proper English speaker, does this mean load the page from the URL which contained the original link that is being followed, or load the page for the page that is landed upon after following the link with the matching text?
You can see that with just two improperly explained rules and the potential permutations of them in combination with the other rules, how this can lead to mayhem. And apparently based on the requests for help, that's what is happening.
Please *clarify* how the rules work and provide an easily-found link in the main dashboard that does exactly that with examples.
Otherwise a great product.
Thank you.
Tagged:
0