"Import list of URLs for

New Altair Community Member

Oct 25, 2018

Updated Nov 5, 2024 by Jocelyn

I would like to use a csv with URLs in it as a startingpoint for the "process documents from the WEB". So instead of defining a starting point, where Rapidminer starts crawling, I would like to use the URLs from the csv. However the process has no Input.

Hope this is not a stupid question, as I am an absolute Rapidminer beginner. Regards Roman

Find more posts tagged with

AI Studio

Text Mining + NLP

Sort by:

1 - 3 of 31

sgenzer

Altair Employee

Accepted Answer

Oct 25, 2018

hi Roman -

So are you sure you want to use "Process Documents from Web"? This operator is a rather specific one - it is exclusively used to deal with PDFs or text files that the crawler finds.

In either case sure, you can use your csv of URLs no problem....something like this should get you going:

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="9.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="9.0.003" expanded="true" height="82" name="Loop Values" width="90" x="179" y="34">
        <parameter key="attribute" value="URL"/>
        <parameter key="iteration_macro" value="URL"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="web:process_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Process Documents from Web" width="90" x="179" y="34">
            <parameter key="url" value="%{URL}"/>
            <list key="crawling_rules"/>
            <process expanded="true">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Web" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

S1001108

New Altair Community Member

Oct 26, 2018

Awesome help! Thank you, Roman

Telcontar120

New Altair Community Member

Oct 27, 2018

The operator Get Pages also does exactly what you are asking, retrieving a set of URLs from a list (input is an exampleset, but if you have the list in a csv already you can easily turn that into an exampleset with Read CSV first). You may find that Process Documents from Web has some "quirks" that make it better to get the pages first separately and then process them using one of the other Process Documents operators.

Sort by:

1 - 1 of 11

sgenzer

Altair Employee

Accepted Answer

Oct 25, 2018

hi Roman -

So are you sure you want to use "Process Documents from Web"? This operator is a rather specific one - it is exclusively used to deal with PDFs or text files that the crawler finds.

In either case sure, you can use your csv of URLs no problem....something like this should get you going:

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="9.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="9.0.003" expanded="true" height="82" name="Loop Values" width="90" x="179" y="34">
        <parameter key="attribute" value="URL"/>
        <parameter key="iteration_macro" value="URL"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="web:process_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Process Documents from Web" width="90" x="179" y="34">
            <parameter key="url" value="%{URL}"/>
            <list key="crawling_rules"/>
            <process expanded="true">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Web" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

View in context

🎉Community Raffle - Win $25

"Import list of URLs for

Find more posts tagged with

Quick Links