extract URL from text
Dear community,
I am trying to extract an URL from a text. Not only do I want to parse Twitter posts for mentioned URLs but also other news content.
I then want to feed the get page operator with the URLs - I am fine with this part but I have not made it to extract URLs so far. Tried it with extract information already...
Help is much appreciated!
Thanks,
Julian
Answers
-
@julian_d pretty easy and you were on the right track. Have to use RegEx parameter in the Extract Information operator.
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Links for later use" width="90" x="112" y="34">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="Tweet Links" value="http.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>2 -
Aylien and Rosette "Extract Entity" operators within RapidMiner also will allow you to pull out URLs if you want to go down the third party API route.
4 -
Thanks @Thomas_Ott and @Telcontar120 for your quick replies. However, it seems like I do not make it to feed the extract Information with the right document type. I am trying to convert the content of the Twitter Post into a document.
I created a sample process. hAve you got an idea?
Thanks again
Julian
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<operator activated="true" class="generate_data_user_specification" compatibility="8.1.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="85">
<list key="attribute_values">
<parameter key="testspalte" value=""http://www.web.de""/>
</list>
<list key="set_additional_roles"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text (3)" width="90" x="179" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="testspalte"/>
<parameter key="attributes" value="Text"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="313" y="85">
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="581" y="85">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<parameter key="attribute_type" value="Nominal"/>
<list key="regular_expression_queries">
<parameter key="link" value="http://.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="true"/>
<parameter key="assume_html" value="true"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
</process>0 -
Hi @julian_d,
Your XML process is broken. To share properly your process :
1. However, I think you can use this simple process using the Extract Entities operator of Aylien extension (to install the last version of this extension RDV in the Marketplace) as indicated by @Telcontar120.
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="8.1.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
<parameter key="connection" value="dkk"/>
<parameter key="query" value="Tesla"/>
<parameter key="limit" value="5"/>
</operator>
<operator activated="true" class="com.aylien.textapi.rapidminer:aylien_entities" compatibility="0.2.000" expanded="true" height="68" name="Extract Entities" width="90" x="246" y="34">
<parameter key="connection" value="Aylien_dkk"/>
<parameter key="input_attribute" value="Text"/>
</operator>
<connect from_op="Search Twitter" from_port="output" to_op="Extract Entities" to_port="Example Set"/>
<connect from_op="Extract Entities" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>2. Like you, I try to play with RapidMiner to feed Extract Information operator from the Search Twitter operator, but
I have only one Tweet in the results.
So I am curious to have elements of answers to solve this problem.
Here my process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="8.1.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
<parameter key="connection" value="dkk"/>
<parameter key="query" value="Tesla"/>
<parameter key="limit" value="5"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="179" y="34">
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="Text" value="1.0"/>
</list>
</operator>
<operator activated="true" class="text:combine_documents" compatibility="8.1.000" expanded="true" height="82" name="Combine Documents" width="90" x="313" y="34"/>
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="34">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="URL" value="http.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="34">
<parameter key="text_attribute" value="text"/>
</operator>
<connect from_op="Search Twitter" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
<connect from_op="Combine Documents" from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>I hope it helps / thanks you
Regards,
Lionel
1 -
@lionelderkrikor you're making it a bit hard on yourself, try this process on for size
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="8.1.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
<parameter key="connection" value="Twitter - Studio Connection"/>
<parameter key="query" value="Tesla"/>
<parameter key="limit" value="5"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="238">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="238"/>
<operator activated="false" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="179" y="34">
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="Text" value="1.0"/>
</list>
</operator>
<operator activated="false" class="text:combine_documents" compatibility="8.1.000" expanded="true" height="82" name="Combine Documents" width="90" x="313" y="34"/>
<operator activated="false" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="34">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="URL" value="http.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<operator activated="false" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="34">
<parameter key="text_attribute" value="text"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="238">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information (2)" width="90" x="112" y="34">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="URL" value="http.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
<connect from_op="Combine Documents" from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>3 -
Hi @Thomas_Ott,
Thanks you for your solution.
Why I did not think to Process Document to Data ????? I don't know... Maybe because it's the end of the day here and it's time to sleep.
Thanks again,
Best regards,
Lionel
2 -
Thanks for your solutions and feedback @Thomas_Ott @lionelderkrikor
The Process Documents operator nearly gets me there. However, if the link is followed by further text, the operator seems not to be able to keep the link only, I attached my sample process - hope it works this time.
Further, the extract information operator does not really filter the URL. Since I want to feed a get page operator I would need to extract the url only. Would be awesome if you had another hint.
Thanks
Julian
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="8.1.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
<list key="attribute_values">
<parameter key="testcolumn" value=""This is the test link http://www.google.com. Let's see if it works""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text (3)" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="testcolumn"/>
<parameter key="attributes" value="Text"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (3)" width="90" x="514" y="34">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information (2)" width="90" x="112" y="34">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="URL" value="http.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="8.1.001" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="238">
<list key="attribute_values">
<parameter key="testcolumn" value=""This is the test link http://www.google.com. Let's see if it works""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="246" y="238">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:combine_documents" compatibility="8.1.000" expanded="true" height="82" name="Combine Documents" width="90" x="380" y="238"/>
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="514" y="238">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="link" value="http://.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="648" y="238">
<parameter key="text_attribute" value="zero"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Nominal to Text (3)" to_port="example set input"/>
<connect from_op="Nominal to Text (3)" from_port="example set output" to_op="Process Documents from Data (3)" to_port="example set"/>
<connect from_op="Process Documents from Data (3)" from_port="example set" to_port="result 1"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
<connect from_op="Combine Documents" from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>0 -
@julian_d it's kinda funny but I'm working on something similiar. I have something that technically works below BUT it only works for http.*.com URLS, which limits it to .coms. Not great.
I think the trick here is to tokenize the text properly where you don't destroy the full http://link.com, select it in the Process Documents and set it to the URL attribute. Then outside the Process Documents operator, you'll have to use an extract macro and loop over a Get Page to pull in the URLs.
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="8.1.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
<list key="attribute_values">
<parameter key="testcolumn" value=""This is the test link http://www.google.com. Let's see if it works""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text (3)" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="testcolumn"/>
<parameter key="attributes" value="Text"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (3)" width="90" x="380" y="34">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information (2)" width="90" x="112" y="34">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="URL" value="(http://.*.com)"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="extract_macro" compatibility="8.1.001" expanded="true" height="68" name="Extract Macro" width="90" x="514" y="34">
<parameter key="macro" value="linky"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="URL"/>
<parameter key="example_index" value="1"/>
<list key="additional_macros"/>
</operator>
<operator activated="false" breakpoints="after" class="generate_data_user_specification" compatibility="8.1.001" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="238">
<list key="attribute_values">
<parameter key="testcolumn" value=""This is the test link http://www.google.com. Let's see if it works""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="false" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="246" y="238">
<list key="specify_weights"/>
</operator>
<operator activated="false" class="text:combine_documents" compatibility="8.1.000" expanded="true" height="82" name="Combine Documents" width="90" x="380" y="238"/>
<operator activated="false" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="514" y="238">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="link" value="http://.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<operator activated="false" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="648" y="238">
<parameter key="text_attribute" value="zero"/>
</operator>
<operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="715" y="34">
<parameter key="url" value="%{linky}"/>
<parameter key="random_user_agent" value="true"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Nominal to Text (3)" to_port="example set input"/>
<connect from_op="Nominal to Text (3)" from_port="example set output" to_op="Process Documents from Data (3)" to_port="example set"/>
<connect from_op="Process Documents from Data (3)" from_port="example set" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
<connect from_op="Combine Documents" from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Get Page" from_port="output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
@julian_d this is incredibly crude but fast. You'll have to tune how you want to Tokenize and extract only URLs
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="generate_data_user_specification" compatibility="8.1.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
<list key="attribute_values">
<parameter key="testcolumn" value=""This is the test link http://www.google.com. Let's see if it works""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="false" breakpoints="after" class="generate_data_user_specification" compatibility="8.1.001" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="238">
<list key="attribute_values">
<parameter key="testcolumn" value=""This is the test link http://www.google.com. Let's see if it works""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="false" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="246" y="238">
<list key="specify_weights"/>
</operator>
<operator activated="false" class="text:combine_documents" compatibility="8.1.000" expanded="true" height="82" name="Combine Documents" width="90" x="380" y="238"/>
<operator activated="false" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="514" y="238">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="link" value="http://.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<operator activated="false" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="648" y="238">
<parameter key="text_attribute" value="zero"/>
</operator>
<operator activated="true" class="social_media:search_twitter" compatibility="8.1.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="136">
<parameter key="connection" value="Twitter"/>
<parameter key="query" value="rapidminer"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text (3)" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
<parameter key="attributes" value="Text"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (3)" width="90" x="380" y="34">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=" #"/>
</operator>
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information (2)" width="90" x="179" y="34">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="URL" value="http.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="split" compatibility="8.1.001" expanded="true" height="82" name="Split" width="90" x="514" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="URL"/>
<parameter key="split_pattern" value=" "/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="648" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="URL_1"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="782" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="URL_1.is_not_missing."/>
</list>
</operator>
<operator activated="true" class="extract_macro" compatibility="8.1.001" expanded="true" height="68" name="Extract Macro (3)" width="90" x="916" y="34">
<parameter key="macro" value="numnum"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="concurrency:loop" compatibility="8.1.001" expanded="true" height="82" name="Loop" width="90" x="1050" y="34">
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="extract_macro" compatibility="8.1.001" expanded="true" height="68" name="Extract Macro (2)" width="90" x="112" y="34">
<parameter key="macro" value="linky"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="URL_1"/>
<parameter key="example_index" value="%{iteration}"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="313" y="34">
<parameter key="url" value="%{linky}"/>
<parameter key="random_user_agent" value="true"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<connect from_port="input 1" to_op="Extract Macro (2)" to_port="example set"/>
<connect from_op="Get Page" from_port="output" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="false" class="extract_macro" compatibility="8.1.001" expanded="true" height="68" name="Extract Macro" width="90" x="715" y="136">
<parameter key="macro" value="linky"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="URL"/>
<parameter key="example_index" value="1"/>
<list key="additional_macros"/>
</operator>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
<connect from_op="Combine Documents" from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Search Twitter" from_port="output" to_op="Nominal to Text (3)" to_port="example set input"/>
<connect from_op="Nominal to Text (3)" from_port="example set output" to_op="Process Documents from Data (3)" to_port="example set"/>
<connect from_op="Process Documents from Data (3)" from_port="example set" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro (3)" to_port="example set"/>
<connect from_op="Extract Macro (3)" from_port="example set" to_op="Loop" to_port="input 1"/>
<connect from_op="Loop" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>3 -
Thanks very much @Thomas_Ott for your immediate and awesome support. I made it to implement it into my main process. Unfortunately the get page operator seems to have an issue with the redirects. A lot of the gathered URLs are forwards from Twitter to other pages starting with https://t.co/.. I have tried hard to build a workaround to get to the final page but did not make it yet. To demonstrate my problem I continued your process a bit. Sorry to ask again but... any idea? ?
Do you further know how I can eliminate dots and commas if they follow an URL? Currently I get an error if somebody mentions an URL followed by a comma or a dot right after because the process documents operator handles the dot as if it was part of the url and the get page operator then complains about a corrupt input.
Thank you!
Julian
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="8.1.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
<parameter key="connection" value="Twitter_DUJ"/>
<parameter key="query" value="rapidminer"/>
<parameter key="limit" value="10"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.1.003" expanded="true" height="103" name="Multiply" width="90" x="179" y="34"/>
<operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes (2)" width="90" x="313" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Id|Text|From-User"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.1.003" expanded="true" height="82" name="Nominal to Text (3)" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
<parameter key="attributes" value="Text"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (3)" width="90" x="447" y="34">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=" #"/>
</operator>
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information (2)" width="90" x="179" y="34">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="URL" value="http.*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="split" compatibility="8.1.003" expanded="true" height="82" name="Split" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="URL"/>
<parameter key="split_pattern" value=" "/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes" width="90" x="715" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="URL_1"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.1.003" expanded="true" height="103" name="Filter Examples" width="90" x="849" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="URL_1.is_not_missing."/>
</list>
</operator>
<operator activated="true" class="loop_examples" compatibility="8.1.003" expanded="true" height="103" name="Loop Examples (3)" width="90" x="983" y="34">
<parameter key="iteration_macro" value="interation"/>
<process expanded="true">
<operator activated="true" class="extract_macro" compatibility="8.1.003" expanded="true" height="68" name="Extract Macro (4)" width="90" x="45" y="34">
<parameter key="macro" value="link_macro"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="URL_1"/>
<parameter key="example_index" value="%{interation}"/>
<list key="additional_macros">
<parameter key="id_macro" value="Id"/>
<parameter key="followed_url_macro" value="URL_1"/>
</list>
</operator>
<operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page (3)" width="90" x="246" y="34">
<parameter key="url" value="%{link_macro}"/>
<parameter key="random_user_agent" value="true"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content (2)" width="90" x="380" y="34"/>
<operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data (3)" width="90" x="514" y="34">
<parameter key="text_attribute" value="content_followed_url"/>
<parameter key="add_meta_information" value="false"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="648" y="34">
<list key="function_descriptions">
<parameter key="id" value="%{id_macro}"/>
<parameter key="followed_url" value="%{followed_url_macro}"/>
</list>
</operator>
<connect from_port="example set" to_op="Extract Macro (4)" to_port="example set"/>
<connect from_op="Get Page (3)" from_port="output" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Documents to Data (3)" to_port="documents 1"/>
<connect from_op="Documents to Data (3)" from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="8.1.003" expanded="true" height="82" name="Append" width="90" x="1117" y="34"/>
<operator activated="true" class="set_role" compatibility="8.1.003" expanded="true" height="82" name="Set Role" width="90" x="1251" y="34">
<parameter key="attribute_name" value="id"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles">
<parameter key="content_followed_url" value="regular"/>
</list>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.1.003" expanded="true" height="82" name="Join" width="90" x="1385" y="136">
<parameter key="join_type" value="outer"/>
<list key="key_attributes"/>
</operator>
<connect from_op="Search Twitter" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Nominal to Text (3)" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Nominal to Text (3)" from_port="example set output" to_op="Process Documents from Data (3)" to_port="example set"/>
<connect from_op="Process Documents from Data (3)" from_port="example set" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Loop Examples (3)" to_port="example set"/>
<connect from_op="Loop Examples (3)" from_port="output 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>1 -
Yes, URLs can be tricky! Right now the regex here is just http.* which basically picks up anything that starts with http but does not specify any kind of terminal character restrictions.
So you could reformulate your URL regex to specify something like this: http.*\.[a-zA-Z]{3}
This will work for any domain that ends in a TLD extension with 3 letters (.com, .net, .org, etc) but it will stop there and not pick up other trailing characters. You can use similar logic to create other versions if you need to be able to take longer URLs or deal with TLDs that don't have 3 letters.
Or if all this regex starts to give you a headache you could look at the Extract Entity operators from either MonkeyLearn or Rosette, since either of them support URL extraction.
2