Webscraping JSON Content With RapidMiner

B00100719
B00100719 New Altair Community Member
edited November 5 in Community Q&A

I am a student, RapidMiner novice and I want to scrape from a site that publishes customer reviews. But I cannot get this to work in RapidMiner.  Here’s an example of the first webpage:


https://www.unum.com/employees/benefits/disability-insurance/long-term-disability-insurance?bvstate=pg:1/ct:r


RapidMiner can pick up everything at the top and bottom of the pages but the actual review text and associated attributes are stored in JSON which the RapidMiner processes just refuse to pick up.  No matter whether I use ‘Get Page(s)’ or ‘Crawl Web’ operators, it doesn’t scrape that part of the page.  Have you ever dealt with this before? 


The page seems to require a token. The JSON file seems to be dynamically created.  


How do I authenticate?

Where do I get a token? 

Where do I put it?  

How do I get the JSON content?


Please and thanks


A very simple example process follows:


<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">

  <context>

    <input/>

    <output/>

    <macros/>

  </context>

  <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">

    <parameter key="logfile" value="C:\Users\AHQ08\Desktop\Unum Reviews\MyLog.log"/>

    <process expanded="true">

      <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="179" y="289">

        <parameter key="url" value="https://www.unum.com/employees/benefits/disability-insurance/long-term-disability-insurance?bvstate=pg:1/ct:r#"/>

        <list key="query_parameters"/>

        <list key="request_properties"/>

      </operator>

      <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="380" y="289">

        <parameter key="create_word_vector" value="false"/>

        <parameter key="keep_text" value="true"/>

        <process expanded="true">

          <connect from_port="document" to_port="document 1"/>

          <portSpacing port="source_document" spacing="0"/>

          <portSpacing port="sink_document 1" spacing="0"/>

          <portSpacing port="sink_document 2" spacing="0"/>

        </process>

      </operator>

      <connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>

      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>

      <portSpacing port="source_input 1" spacing="0"/>

      <portSpacing port="sink_result 1" spacing="0"/>

      <portSpacing port="sink_result 2" spacing="0"/>

    </process>

  </operator>

</process>


Tagged:

Best Answer

Answers