Problems with processing the answer from a GET request

David_Bartholomew
David_Bartholomew New Altair Community Member
edited November 5 in Community Q&A
Hi guys,
I want to mine performance data of footballers for an essay.

As a source I found Goaloo1 (I cant post links yet). The problem is that they don't provide the information in a file, so I want to use the Web Mining Extension instead.

I managed to identify the GET request URL that provides all the data for a given season of a given league (cant post that either :neutral: ). Only problem is that the document is just one big string that (with some minor RegEx replacements) can be turned into multiple CSVs. Now I could do that manually in VSC, but I would rather learn to do it all properly in Rapid Miner.

First things first, I couldn't get the GET (REST) operator to work (I got an "Error accessing REST Service"):
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process"><br>    <parameter key="logverbosity" value="init"/><br>    <parameter key="random_seed" value="2001"/><br>    <parameter key="send_mail" value="never"/><br>    <parameter key="notification_email" value=""/><br>    <parameter key="process_duration_for_mail" value="30"/><br>    <parameter key="encoding" value="SYSTEM"/><br>    <process expanded="true"><br>      <operator activated="true" class="web:crud_get" compatibility="9.7.000" expanded="true" height="68" name="GET (REST)" width="90" x="112" y="85"><br>        <parameter key="request_url" value="https://info.goaloo1.com/jsdata/count/2020-2021/playertech_36.js"/><br>        <list key="request_headers"/><br>        <parameter key="response_body_type" value="json"/><br>        <parameter key="fail_on_endpoint_error" value="true"/><br>      </operator><br>      <operator activated="true" class="text:documents_to_data" compatibility="9.4.000" expanded="true" height="82" name="Documents to Data" width="90" x="380" y="85"><br>        <parameter key="text_attribute" value="Test"/><br>        <parameter key="add_meta_information" value="true"/><br>        <parameter key="datamanagement" value="double_sparse_array"/><br>        <parameter key="data_management" value="auto"/><br>        <parameter key="use_processed_text" value="false"/><br>      </operator><br>      <connect from_op="GET (REST)" from_port="response" to_op="Documents to Data" to_port="documents 1"/><br>      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>    </process><br>  </operator><br></process><br><br>

I did manage to get the document by using the "Get Page" operator though. From what I gathered online, I now need to use the "Replace" operator an an ExampleSet. Therefore, I need to transform the Document to an ExampleSet first. I found two ways, but I couldn't get any of them to work.

The first way was to use the "Documents to Data" operation. Although it does give me an ExampleSet that I can use the "Replace" operation on, it cuts of about 99% of the information of the original document:
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process"><br>    <parameter key="logverbosity" value="init"/><br>    <parameter key="random_seed" value="2001"/><br>    <parameter key="send_mail" value="never"/><br>    <parameter key="notification_email" value=""/><br>    <parameter key="process_duration_for_mail" value="30"/><br>    <parameter key="encoding" value="SYSTEM"/><br>    <process expanded="true"><br>      <operator activated="true" class="web:get_webpage" compatibility="9.7.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="85"><br>        <parameter key="url" value="https://info.goaloo1.com/jsdata/count/2020-2021/playertech_36.js"/><br>        <parameter key="random_user_agent" value="false"/><br>        <parameter key="connection_timeout" value="10000"/><br>        <parameter key="read_timeout" value="10000"/><br>        <parameter key="follow_redirects" value="true"/><br>        <parameter key="accept_cookies" value="none"/><br>        <parameter key="cookie_scope" value="global"/><br>        <parameter key="request_method" value="GET"/><br>        <list key="query_parameters"/><br>        <list key="request_properties"/><br>        <parameter key="override_encoding" value="false"/><br>        <parameter key="encoding" value="SYSTEM"/><br>        <parameter key="keep_sensitive_headers" value="false"/><br>      </operator><br>      <operator activated="true" class="text:documents_to_data" compatibility="9.4.000" expanded="true" height="82" name="Documents to Data" width="90" x="380" y="85"><br>        <parameter key="text_attribute" value="Test"/><br>        <parameter key="add_meta_information" value="true"/><br>        <parameter key="datamanagement" value="double_sparse_array"/><br>        <parameter key="data_management" value="auto"/><br>        <parameter key="use_processed_text" value="false"/><br>      </operator><br>      <connect from_op="Get Page" from_port="output" to_op="Documents to Data" to_port="documents 1"/><br>      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>    </process><br>  </operator><br></process><br><br>

The second way I found was to use the Process Documents operation. Same problem:
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process"><br>    <parameter key="logverbosity" value="init"/><br>    <parameter key="random_seed" value="2001"/><br>    <parameter key="send_mail" value="never"/><br>    <parameter key="notification_email" value=""/><br>    <parameter key="process_duration_for_mail" value="30"/><br>    <parameter key="encoding" value="SYSTEM"/><br>    <process expanded="true"><br>      <operator activated="true" class="text:read_document" compatibility="9.4.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="85"><br>        <parameter key="file" value="C:/Users/[Hidden]><br>        <parameter key="extract_text_only" value="true"/><br>        <parameter key="use_file_extension_as_type" value="true"/><br>        <parameter key="content_type" value="txt"/><br>        <parameter key="encoding" value="SYSTEM"/><br>      </operator><br>      <operator activated="true" class="text:process_documents" compatibility="9.4.000" expanded="true" height="103" name="Process Documents" width="90" x="380" y="85"><br>        <parameter key="create_word_vector" value="false"/><br>        <parameter key="vector_creation" value="TF-IDF"/><br>        <parameter key="add_meta_information" value="false"/><br>        <parameter key="keep_text" value="true"/><br>        <parameter key="prune_method" value="none"/><br>        <parameter key="prune_below_percent" value="3.0"/><br>        <parameter key="prune_above_percent" value="30.0"/><br>        <parameter key="prune_below_rank" value="0.05"/><br>        <parameter key="prune_above_rank" value="0.95"/><br>        <parameter key="datamanagement" value="double_sparse_array"/><br>        <parameter key="data_management" value="auto"/><br>        <process expanded="true"><br>          <connect from_port="document" to_port="document 1"/><br>          <portSpacing port="source_document" spacing="0"/><br>          <portSpacing port="sink_document 1" spacing="0"/><br>          <portSpacing port="sink_document 2" spacing="0"/><br>        </process><br>      </operator><br>      <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/><br>      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>    </process><br>  </operator><br></process><br>

Can anybody help me with my problem? Or should I maybe follow a different approach to mining the data altogether?

Im very new to Rapid Miner, so please excuse any Newbie mistakes I make.


Best
David