"Extract Rows from Text Value (JSON Array)"

megmed
megmed New Altair Community Member
edited November 5 in Community Q&A
Hi,

Is there any solution to extract rows from text (JSON)? I couldn't find an appropriate rapid miner operater.

From a database table I get text data (JSON Array) from a column of the type BLOB.
Data like this:
[{"timestamp":1368525704,"event_type":"dom.movement","cursor_x":"671","cursor_y":"452"},
{"timestamp":1368525704,"event_type":"dom.movement","cursor_x":"657","cursor_y":"462"},
{...},
{...}]


I now want to extract each {...} as a row with columns split by the "," char. At the and I want to get a table like this:
timestampevent_typecursor_xcursor_y
1368525704dom.movement671452
1368525704dom.movement657462
............
How can I transform the JSON Array into rows?

I tried "Extract Document" -> "Cut Document" with RegEx \{(.*?)\} -> "Documents To Data". This works, but "Extract Document" extracts only one doc/row specified by the "example index" param.

Best,
Micha

Answers

  • JEdward
    JEdward New Altair Community Member
    Have you tried JSON to XML then connecting that to a Write Document operator & connecting the file output port to a Read XML operator? 

    That's my normal method. 
  • megmed
    megmed New Altair Community Member
    Thanks for your reply.

    I already tried this before but the problem is JSON to XML need a doc as an input. I have to start with Data Table type which is the output of the Read Database operator. Thus it doesn't work for me.
  • Andrew2
    Andrew2 New Altair Community Member
    Hello Micha

    After a bit of gymnastics I got the attached basically working. The input JSON is probably not quite what you read from your data - I had to add something around it - but it's a start and will give you some ideas.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="5.3.013" expanded="true" height="60" name="Generate Data by User Specification (3)" width="90" x="112" y="75">
            <list key="attribute_values">
              <parameter key="json" value="&quot;{\&quot;rows\&quot;: [ {\&quot;timestamp\&quot;:1368525704,\&quot;event_type\&quot;:\&quot;dom.movement\&quot;,\&quot;cursor_x\&quot;:\&quot;671\&quot;,\&quot;cursor_y\&quot;:\&quot;452\&quot;}, {\&quot;timestamp\&quot;:1368525704,\&quot;event_type\&quot;:\&quot;dom.movement\&quot;,\&quot;cursor_x\&quot;:\&quot;657\&quot;,\&quot;cursor_y\&quot;:\&quot;462\&quot;}, {\&quot;timestamp\&quot;:1368525705,\&quot;event_type\&quot;:\&quot;dom.movement\&quot;,\&quot;cursor_x\&quot;:\&quot;656\&quot;,\&quot;cursor_y\&quot;:\&quot;452\&quot;}, {\&quot;timestamp\&quot;:1368525706,\&quot;event_type\&quot;:\&quot;dom.movement\&quot;,\&quot;cursor_x\&quot;:\&quot;659\&quot;,\&quot;cursor_y\&quot;:\&quot;442\&quot;}] }&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.3.013" expanded="true" height="76" name="Nominal to Text" width="90" x="112" y="165"/>
          <operator activated="true" class="text:data_to_documents" compatibility="5.3.001" expanded="true" height="60" name="Data to Documents" width="90" x="246" y="75">
            <list key="specify_weights"/>
          </operator>
          <operator activated="true" class="text:combine_documents" compatibility="5.3.001" expanded="true" height="76" name="Combine Documents" width="90" x="246" y="165"/>
          <operator activated="true" class="web:json_to_xml" compatibility="5.3.001" expanded="true" height="60" name="Json to XML" width="90" x="380" y="75"/>
          <operator activated="true" class="text:cut_document" compatibility="5.3.001" expanded="true" height="60" name="Cut Document (2)" width="90" x="380" y="165">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="text" value="//rows"/>
            </list>
            <list key="namespaces"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <process expanded="true">
              <operator activated="true" class="text:extract_information" compatibility="5.3.001" expanded="true" height="60" name="Extract Information" width="90" x="112" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <parameter key="attribute_type" value="Numerical"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="time" value="fn:number(/rows//timestamp[1])"/>
                </list>
                <list key="namespaces"/>
                <parameter key="assume_html" value="false"/>
                <list key="index_queries"/>
              </operator>
              <operator activated="true" class="text:extract_information" compatibility="5.3.001" expanded="true" height="60" name="Extract Information (2)" width="90" x="246" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <parameter key="attribute_type" value="Numerical"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="cursor_y" value="fn:number(/rows//cursor_y[1])"/>
                </list>
                <list key="namespaces"/>
                <parameter key="assume_html" value="false"/>
                <list key="index_queries"/>
              </operator>
              <operator activated="true" class="text:extract_information" compatibility="5.3.001" expanded="true" height="60" name="Extract Information (3)" width="90" x="380" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <parameter key="attribute_type" value="Numerical"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="event_type" value="fn:string(/rows//event_type[1])"/>
                </list>
                <list key="namespaces"/>
                <parameter key="assume_html" value="false"/>
                <list key="index_queries"/>
              </operator>
              <operator activated="true" class="text:extract_information" compatibility="5.3.001" expanded="true" height="60" name="Extract Information (4)" width="90" x="514" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <parameter key="attribute_type" value="Numerical"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="cursor_x" value="fn:number(/rows//cursor_x[1])"/>
                </list>
                <list key="namespaces"/>
                <parameter key="assume_html" value="false"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_op="Extract Information (2)" to_port="document"/>
              <connect from_op="Extract Information (2)" from_port="document" to_op="Extract Information (3)" to_port="document"/>
              <connect from_op="Extract Information (3)" from_port="document" to_op="Extract Information (4)" to_port="document"/>
              <connect from_op="Extract Information (4)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:documents_to_data" compatibility="5.3.001" expanded="true" height="76" name="Documents to Data" width="90" x="514" y="75">
            <parameter key="text_attribute" value="text"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.3.013" expanded="true" height="76" name="Select Attributes" width="90" x="514" y="165">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="|query_key|text"/>
            <parameter key="invert_selection" value="true"/>
          </operator>
          <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
          <connect from_op="Combine Documents" from_port="document" to_op="Json to XML" to_port="document"/>
          <connect from_op="Json to XML" from_port="document" to_op="Cut Document (2)" to_port="document"/>
          <connect from_op="Cut Document (2)" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    The tricks are

    - Use Nominal to Text so that Data To Documents can be used
    - Combine Documents makes a single document for JSON to XML to work on
    - Use Cut Document to chop the XML document into chunks - the nifty bit is using Tokenize with the power of XPath to determine the number of rows automatically
    - Use various Extract Information operators with XPath inside the Cut Document operator to obtain values from the cut documents individually. This creates meta data in the document that will automatically become attributes in the final example set when the Documents to Data operator is used.

    You'll have to modify it for your needs. Hope it helps.

    regards

    Andrew
  • megmed
    megmed New Altair Community Member
    Hi Andrew,

    Thank you very much for your suggestion. And yes it works, but only with one JSON array of one column field of one row (like your generated data) in my database. Now I have the problem, that the "Combine Documents" operator cannot handle that much documents it is getting from the "Data to Documents" operator. Rapid Miner does not respond when I activate this operator.

    Let me say I have 5000 rows in my database, follows that I will have 5000 JSON arrays from the BLOB column, follows 5000 documents. This might be a problem for the "Combine Documents" operator?

    I would generate some data for you, but I don't know how to generate more than one row of user specific data with the generate data operator.

    Many thanks in advance
    Micha

  • Andrew2
    Andrew2 New Altair Community Member
    Hello Micha

    You can use the Append operator to join together the output from multiple Generate Data operators.

    I did this with the original process but the XML that gets created is not valid and because I don't understand the format of the data, it's difficult to know what to suggest.

    regards

    Andrew
  • rachel_lomasky
    rachel_lomasky New Altair Community Member

    Has there been any update to this method in the past years?  The JSON I have doesn't make properly formed XML (for example, some of the elements are numbers).  It would be nice to parse the JSON directly.