Conditional Tag Search with RegEx Output

Question

Hi everyone, i am currently working on a big set of data (~ 4 million HTML files stored on my computer) and I am wondering if there is any search/parse fuction in RM that allows me to search all documents for a unique tag and IF the criteria is found THAN search in the same string for an regular expression that will match a certain number. For example i do habe a string like: 14,825 I want to search for the tag "AuditFeesExpenses" and IF it is found RM should search for an regular expression that meets the criteria of the digit "14,825" (the RegEx is not my problem!). Anyone of you have an idea if this is possible in RM? Thanks! Flo

MartinLiebig · Answer

by the way, maybe this is also a use case for Apache drill. It has a jdbc connector and might work. Would be AMAZING to see this working

~Martin

Limegreenman900_1 · Answer

Perfect, I had a short glance at Solr and it seems to fit my needs!

Thanks for your code proposal but I think it will take too long to convert every document in a XML file first before processing it ;)

JEdward · Answer

Yep, the Solr server sounds like the way to go with this one. It's designed for search & solving exactly this type of problem, you can install the extension from the marketplace. But, assuming you don't want to use Solr (no... I really recommend you do for 4 million files), then here is a way to do it. I would also suggest (from the file structure) that an XPath might also work better than a regular expression. Here's a quick example using your one below. You can use XPath both with the ReadXML operator, but for that many documents (if not using Solr) I would recommend using some Groovy Script within your workflow to process them. In this example I convert from Html to XML, but you might not need this if your documents are already in well formatted XML. Give it a try on a couple of files.