What is the "Extract Information" Operator capable of?
Thank you very much in advance!
Answers
-
@RaffiH
Hello
This operator extracts information from a document with structured content. The purpose of this operator is to extract informations from the structured content of a document.The extracted information will be added as meta data to the document and if wished might be added as attribute later. There are several options available for specifying which information should be extracted. In String Matching mode you may specify a start String and an end String, if both are found in the document, the characters between are extracted. Regular Expressions let you specify any expression and will use the first matching group as extraction. If it's to difficult to include the intermediate characters into the expression in a well defined way, you might find Regular Region mode useful, where you can define two regular expressions. As on String Matching mode, the first defines the start and the last the end and anything intermediate will be extracted. The most sophisticated variant is the XPath mode, where you can enter an arbitrary XPath expression. This proves usefull, especially when trying to extract information from a website. Since XPath expressions are only available for XML files, you will have to take care, that the documents are well defined XML. This might be ensured by the assume_html parameter of the Document Processing operator, that will use a special parser to correct errors in the HTML. It is also possible to extract informations from a JSON document with a JSONPath expression. As with the XPath mode, you will have to take care, that the document provided is a valid JSON document
regards
mbs3 -
Hey,Extract Information is actually one of the hidden gems, because it adds some tools you may want to get for advanced parsing. Most importantly it offers you the option to use JSONPath.Best,Martin3
-
@RaffiH
Hello
You can find more information here in these two links
https://community.rapidminer.com/discussion/55461/parsing-json-in-rapidminer-using-the-webautomation-extension-by-old-world-computing
https://community.rapidminer.com/discussion/55796/parsing-json-with-owcs-webautomation-extension-extracting-two-or-more-relational-example-sets
I hope this helps
mbs
0