"Extract data from a website to Excel-file"
currant
New Altair Community Member
Hi!
I want to extract some data from the following website:
http://ec.europa.eu/sanco_pesticides/public/index.cfm?event=substance.selection
I want to generate a list in Excel with an ID-column and a pesticide name-column. Both should be extracted from the website, section "2) Select Pesticide".
As I am a RM beginner, I did not manage to cut the html-document and generate the two columns... The only thing I was able to do, was to get the web page on my computer.
Could anyone give me some hints I could start with?
Thanks in advance!
Currant
I want to extract some data from the following website:
http://ec.europa.eu/sanco_pesticides/public/index.cfm?event=substance.selection
I want to generate a list in Excel with an ID-column and a pesticide name-column. Both should be extracted from the website, section "2) Select Pesticide".
As I am a RM beginner, I did not manage to cut the html-document and generate the two columns... The only thing I was able to do, was to get the web page on my computer.
Could anyone give me some hints I could start with?
Thanks in advance!
Currant
0
Answers
-
Hu Currant,
this seems to be an easy task. After retrieving the website I would suggest to use XPath since the desired data is structured within a HTML select element. If you want to deal with multiple XPath or RegEx matches there is only one interesting operator: "Cut Document" (contained in the Text Processing extension).
Here a little working example. You just have to add the id and convert the collection into a table ("Documents to Data").<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Best regards
<process version="5.1.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
<process expanded="true" height="145" width="279">
<operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
<parameter key="url" value="http://ec.europa.eu/sanco_pesticides/public/index.cfm?event=substance.selection"/>
<parameter key="random_user_agent" value="true"/>
<list key="query_parameters"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="content" value="//h:select[@id='optionsList1']/h:option/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true" height="607" width="747">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Matthias0 -
Dear All,
Now, I managed to the html-files on my computer and I could extract the html-information from the file. See script below.
Code:<process version="5.1.006">
Now, my question is:
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="283" width="435">
<operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
<parameter key="file" value="C:\temp\Bentazon.htm"/>
<parameter key="extract_text_only" value="false"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.001" expanded="true" height="94" name="Process Documents" width="90" x="179" y="30">
<process expanded="true" height="296" width="429">
<operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Namen" value="//h:h1[@style='text-align:left']/text()"/>
<parameter key="Animal" value="//h:div[@class='col60']/h:ul/h:li[2]/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<operator activated="true" class="store" compatibility="5.1.006" expanded="true" height="60" name="Store" width="90" x="179" y="30">
<parameter key="repository_entry" value="RapidMiner/RDs"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
How can I export the attributes "Namen" and "Animal" to Excel?
(It's probably an easy job ... but I could not work it out...
currant0 -
Hi,
you can select the desired attributes via the "Select Attributes" operator, just select subset as the attribute filter type and select the desired attributes. Then put in a "Write Excel" operator, specify the file, and voilà.
Your modified process (you might have to change the selected attributes):<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Regards,
<process version="5.1.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
<process expanded="true" height="283" width="547">
<operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
<parameter key="file" value="C:\temp\Bentazon.htm"/>
<parameter key="extract_text_only" value="false"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.001" expanded="true" height="94" name="Process Documents" width="90" x="179" y="30">
<process expanded="true" height="296" width="550">
<operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Namen" value="//h:h1[@style='text-align:left']/text()"/>
<parameter key="Animal" value="//h:div[@class='col60']/h:ul/h:li[2]/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<operator activated="true" class="store" compatibility="5.1.008" expanded="true" height="60" name="Store" width="90" x="179" y="30">
<parameter key="repository_entry" value="RapidMiner/RDs"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.1.008" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|Namen|Animal"/>
</operator>
<operator activated="true" class="write_excel" compatibility="5.1.008" expanded="true" height="60" name="Write Excel" width="90" x="447" y="30">
<parameter key="excel_file" value="C:\export.xls"/>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Write Excel" to_port="input"/>
<connect from_op="Write Excel" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Marco0 -
Now it works!
Thanx a lot for your support!
currant0