🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

What is the "Extract Information" Operator capable of?

User: "RaffiH"
New Altair Community Member
Updated by Jocelyn
I've recently startet using RapidMiner with respect to my bachelor thesis. I want to use RapidMiner to analyze websites of specific companys and it would be nice if someone could explain the "Extract Information" Operator to me. I don't really understand in which cases I can use it. 

Thank you very much in advance!

Find more posts tagged with

Sort by:
1 - 4 of 41
    User: "[Deleted User]"
    New Altair Community Member
    Updated by [Deleted User]
    @RaffiH
    Hello

    This operator extracts information from a document with structured content. The purpose of this operator is to extract informations from the structured content of a document.

     The extracted information will be added as meta data to the document and if wished might be added as attribute later. There are several options available for specifying which information should be extracted. In String Matching mode you may specify a start String and an end String, if both are found in the document, the characters between are extracted. Regular Expressions let you specify any expression and will use the first matching group as extraction. If it's to difficult to include the intermediate characters into the expression in a well defined way, you might find Regular Region mode useful, where you can define two regular expressions. As on String Matching mode, the first defines the start and the last the end and anything intermediate will be extracted. The most sophisticated variant is the XPath mode, where you can enter an arbitrary XPath expression. This proves usefull, especially when trying to extract information from a website. Since XPath expressions are only available for XML files, you will have to take care, that the documents are well defined XML. This might be ensured by the assume_html parameter of the Document Processing operator, that will use a special parser to correct errors in the HTML. It is also possible to extract informations from a JSON document with a JSONPath expression. As with the XPath mode, you will have to take care, that the document provided is a valid JSON document       


    regards
    mbs
    Hey,
    Extract Information is actually one of the hidden gems, because it adds some tools you may want to get for advanced parsing. Most importantly it offers you the option to use JSONPath.

    Best,
    Martin
    User: "RaffiH"
    New Altair Community Member
    OP
    @mschmitz
    What needs to be done to use those tools? 

    In my process I first use "Get Page" with an URL. Then I use "Extract Content". After filtering the stopwords I want to use "Extract information" but I don't know how.

    Thank you very much for your answer!