How to use a Macro on Extract Information

Question

Hi! I'm looking for some help with the Extarct Information operator combined with macros. I have built a crawling WebService with RapidMiner Server which extract prices of products from different pages. The Layout is Simple. The only thing that changes is the RegEx used to extract the information from each page. I tried to create an exampleset with the domain and rules for each field in order to keep is simple to add new domains that could be crawled but when I try to use a macro under the query expression nothing happens. Does anybody have tried to use this approach? how could I use the Set Parameters from ExampleSet with the Extract Information operator. url https://www.elpalaciodehierro.com/charm-chile-39861202.html \s{1,} (.*)\s{1,}"/> [$]\S([0-9,.]{1,})"/> \s{1,}[$]\S([0-9,.]{1,})\s{1,} "/> \s{1,}[$]\S([0-9,.]{1,})\s{1,}\s{1,}

"/>

Marco_Barradas · Answer

Yes you could create a DS with your XSLT configuration and query it to obtain a macro that could be used on your operator,this way you only have One process to maintain and if for some reason you need to extract another value it would be really easy to make the change for the 200 sites instead of going over 200 process to make this little change.

kayman · Answer

Yeah, tell me about it... We crawl 200 sites at the moment and they all have their own specific way of making it complex. In the end it's all about flexibility, and for us putting the logic in XPath worked out best.

We didn't find a single site yet where we couldn't get the data with XPath, where regex would have been more challenging. One way to deal with dynamic attributes is just looking at surrounding tags, typically product listings are in a list for instance, or you could use something like 'span where the text contains a dollar sign', which typically works out fine also.

Now, I actually like your idea but will probably use it on our XSLT template rather than on the extract operator.
In essence our Xpath is always the same as we are looking for like 10 different items (price, availabilty, used image etc) but whether they exist or not doesn't matter. The only difference from site to site is the path to a value, and if it's not it just returns an empty attribute.

So I am going to use your idea to dynamically inject this paths in my XSLT, I can then indeed create a reference file that contains the XPaths by site and loop this through one template instead of using a template by site.

Marco_Barradas · Answer

Hi @kayman as you mention this approach is another way of solving my current task but it changing RegEx for Xpath and may or may not flexible enough in some cases that I've seen on my webcrawling so far. Sometimes the price of the product is listed under a class that seems to be dynamic like class=article-abcha67323876-asdalji a lot a attributes or text after>$ 1,000 That's why I took the RegEx approach and it seemed simple to store the RegEx rules on a table and so that I only needed to create a new row under the Datatable to create a new rule to extract data from certain domain. I'll wait to see if anybody else gives us a solution before marking your answer as a solution. But thanks for the process and the approach it helps me see other ways of solving my task.