Parsing JSON in RapidMiner using the WebAutomation Extension by Old World Computing

Jana_OWC
Jana_OWC New Altair Community Member
edited November 5 in Community Q&A
Hi there,

we're a consulting company based in Bochum, Germany, specialized on establishing data science and big data as strategic capability in companies. We also develop RapidMiner extensions, bringing a bunch of useful functions to RapidMiner! We thought you might be interested in learning about our extensions and how to use them, so we'll start here with a tutorial on using our WebAutomation extension, enabling you to easily parse JSON and integrate REST APIs.

We hope this is useful to you, please feel free to ask any questions below or find us on Twitter! We'll upload more tutorials explaining other functionalities of our various extensions soon.

The WebAutomation Extension by Old World Computing


The extension splits the process of parsing JSON in two steps. In the first step, the user applies special operators to define the format of the JSON data. This definition can be saved as regular object in the repository or generated on the fly. The second step is applying the specifications to parse JSON in to one or more data sets. Several operators exist for this: While reading data from a file or out of a data set is possible, there is also a handy request operator, fetching the data directly from a web service. Fed with the raw JSON data, all of these operators unpack the object’s structure and create one or more example sets to be used in further analysis.


Defining a format to extract a simple table from JSON data

So how exactly does it work? As an example, let’s say you are provided with data about books in JSON format. The data comes from an internal database where it is stored as a JSON string.

{
   "books":[
      {
         "title":"Java ist auch eine Insel",
         "subtitle":"Einführung, Ausbildung, Praxis",
         "authors":[
            {
               "first name":"Christian",
               "last name":"Ullenboom"
            }
         ],
         "language":"German",
         "edition":"13., aktualisierte und überarbeitete Auflage",
         "publication date":2018,
         "publishing company":"Bonn Rheinwerk Verlag",
         "keywords":[
            "Java Standard Edition 8",
            "Java Standard Edition 9",
            "Programming"
         ]
      },
      {
         "title":"The elements of statistical learning",
         "subtitle":"data mining, inference, and prediction",
         "authors":[
            {
               "first name":"Trevor",
               "last name":"Hastie"
            },
            {
               "first name":"Robert",
               "last name":"Tibshirani"
            },
            {
               "first name":"Jerome",
               "last name":"Friedman"
            }
         ],
         "language":"English",
         "edition":"2. ed., corr. at 5. print.",
         "publication date":2011,
         "publishing company":"NY Springer",
         "keywords":[
            "Machine Learning",
            "Statistics"
         ]
      },
      {
         "title":"Learning Spark",
         "subtitle":"lightning-fast data analysis",
         "authors":[
            {
               "first name":"Holden",
               "last name":"Karau"
            },
            {
               "first name":"Andy",
               "last name":"Konwinski"
            },
            {
               "last name":"Wendell"
            },
            {
               "first name":"Matei",
               "last name":"Zaharia"
            }
         ],
         "language":"English",
         "publication date":2015,
         "publishing company":"Beijing [u.a.] O'Reilly",
         "keywords":[
            "Machine Learning",
            "Cluster",
            "Big Data"
         ]
      }
   ]
}

Looking at the JSON, you will see that there are several relations within the data that cannot be expressed as a simple table. For example, there can be various author objects per book. For now however, we are only going to extract the tabular data, but the extension can easily extract these relations as separate, relational tables or denormalized with duplicated entries.

As can also be seen in the tree view to the right, the extension uses operators with subprocesses to model the JSON’s structure. There is an Array operator within the Process Object operator, which in turn houses Extract Properties and Commit Row operators.

Together, these operators define the structure of how data is extracted from the JSON file. Process Object and Process Array will tell the parser to enter the object or the array. As the JSON always starts with an object or an array, the outermost operator will not contain any parameters. However, note that inside of this outer operator, there can be further Process Array and Process Object operators which will have parameters, mirroring the JSON structure.

Let’s see how this looks for the inner Process Array operator:

Here, we enter the „books“ property and define it to be an array of objects. Whatever happens now inside the Process Array operator will react to the inner objects.

Taking a look at the inner process of Process Array, we see two operators: Extract Properties and Commit Row. Both define what is supposed to happen when the parser finds an object within the book array of a given JSON. In short, this process defines that the parser is supposed to extract some properties and store them as Attributes in the resulting example set and then finish a row in the resulting example set when the object is processed, meaning we can extract any information from properties of the current object that have scalar values like strings, numbers or booleans. Once we have done that, we can commit the row, so that each object is represented by a row in the resulting example set.

In this case, the objects in books are provided with titles, subtitles, and more detailed information such as language or publication date. To extract these properties, we configure the extract properties parameter in the Extract Properties operator: under “property”, we set the name of the property. In this case title, subtitle and so on. On the right under “Attribute name and type” we can freely assign a name of an attribute and its type. When the definition is used for parsing, the result will contain this attribute and the information will be stored in it.

Please note that the specification object needs to be passed to the outputs of the Process Object and Process Array operators. There is always one input and multiple outputs, but you at least have to provide one output. You can get the final specification on the outermost Process Object or Process Array operator.

Use a parser specification to parse JSON data

In the next paragraph, we will see how we can use the specification created in the last paragraph to actually parse a JSON. We will continue the scenario where we got the book data from a database table. All we need to do now is to add a Process JSON from Data operator, feed it with the data and the parser specification and tell it which Attribute contains the JSON data.

The parser will then take the input specification(s) and generate one data set per specification.


You can use the same specification with the various other parse operators. There’s one to parse JSON from files and file objects, coming in very handy when used in connection with the RapidMiner Scoring Agents or RapidMiner Server’s web services. You can also use JSON delivered by a macro, which finds its application mainly in the deployment as web service or in the web application building. And last but not least, the integration with the new operators to interact with RESTful web services allows to parse web service responses directly without overhead.



Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    This is a great extension!  It really makes working with JSON files much easier and is well worth the modest cost for anyone who works with JSON formats regularly.
  • Somnath_Das
    Somnath_Das New Altair Community Member
    The extension looks good, can we get a demo license to check out the operators?
    You can contact us in the mail id somnath@datafotis.com
  • Jana_OWC
    Jana_OWC New Altair Community Member
    Great that you like our extension, we have send you a demo license. You can also check out our other RapidMiner extension: https://oldworldcomputing.com/en/software-products/
    Our newest extension is the WebAppBuilder, we published an introduction here: https://youtu.be/1KRNEGjPw4E