CrawlWeb a news site for a specific keyword
Hi everyone
I am new here! i have a problem with crawlweb which i'm not able to solve, i tried and googled for weeks now.. (anyway it seems pretty simple but I just dont get it..)
I want to crawl a newssite (here: http://www.bbc.com/) for a keyword (here: .*zuckerberg.*) and save 100 results in .txt
But it just doesn't work, i tried everything but i don't seem to get it done.
I hope you can help me, please see my process in .xml.
Thank you very much for your help in advance!
<?xml version="1.0" encoding="UTF-8"?>
-<process version="8.2.000">
-<context>
<input/>
<output/>
<macros/>
</context>
-<operator name="Process" expanded="true" compatibility="8.2.000" class="process" activated="true">
<parameter value="init" key="logverbosity"/>
<parameter value="2001" key="random_seed"/>
<parameter value="never" key="send_mail"/>
<parameter value="" key="notification_email"/>
<parameter value="30" key="process_duration_for_mail"/>
<parameter value="SYSTEM" key="encoding"/>
-<process expanded="true">
-<operator name="Crawl Web" expanded="true" compatibility="7.3.000" class="web:crawl_web" activated="true" y="34" x="112" width="90" height="68">
<parameter value="http://www.bbc.com/" key="url"/>
-<list key="crawling_rules">
<parameter value=".*tech.*" key="follow_link_with_matching_url"/>
<parameter value=".*zuckerberg.*" key="store_with_matching_url"/>
<parameter value=".*news.*" key="follow_link_with_matching_url"/>
<parameter value=".*zuckerberg.*" key="store_with_matching_content"/>
</list>
<parameter value="false" key="write_pages_into_files"/>
<parameter value="true" key="add_pages_as_attribute"/>
<parameter value="txt" key="extension"/>
<parameter value="100" key="max_pages"/>
<parameter value="4" key="max_depth"/>
<parameter value="web" key="domain"/>
<parameter value="1000" key="delay"/>
<parameter value="2" key="max_threads"/>
<parameter value="10000" key="max_page_size"/>
<parameter value="rapid-miner-crawler" key="user_agent"/>
<parameter value="true" key="obey_robot_exclusion"/>
<parameter value="false" key="really_ignore_exclusion"/>
</operator>
-<operator name="Process Documents from Data" expanded="true" compatibility="8.1.000" class="text:process_document_from_data" activated="true" y="34" x="313" width="90" height="82">
<parameter value="false" key="create_word_vector"/>
<parameter value="TF-IDF" key="vector_creation"/>
<parameter value="true" key="add_meta_information"/>
<parameter value="true" key="keep_text"/>
<parameter value="none" key="prune_method"/>
<parameter value="3.0" key="prune_below_percent"/>
<parameter value="30.0" key="prune_above_percent"/>
<parameter value="0.05" key="prune_below_rank"/>
<parameter value="0.95" key="prune_above_rank"/>
<parameter value="double_sparse_array" key="datamanagement"/>
<parameter value="auto" key="data_management"/>
<parameter value="false" key="select_attributes_and_weights"/>
<list key="specify_weights"/>
-<process expanded="true">
-<operator name="Extract Content" expanded="true" compatibility="7.3.000" class="web:extract_html_text_content" activated="true" y="34" x="45" width="90" height="68">
<parameter value="true" key="extract_content"/>
<parameter value="5" key="minimum_text_block_length"/>
<parameter value="true" key="override_content_type_information"/>
<parameter value="true" key="neglegt_span_tags"/>
<parameter value="true" key="neglect_p_tags"/>
<parameter value="true" key="neglect_b_tags"/>
<parameter value="true" key="neglect_i_tags"/>
<parameter value="true" key="neglect_br_tags"/>
<parameter value="true" key="ignore_non_html_tags"/>
</operator>
<operator name="Unescape HTML Document" expanded="true" compatibility="7.3.000" class="web:unescape_html" activated="true" y="34" x="179" width="90" height="68"/>
-<operator name="Write Document" expanded="true" compatibility="8.1.000" class="text:write_document" activated="true" y="34" x="313" width="90" height="82">
<parameter value="true" key="overwrite"/>
<parameter value="SYSTEM" key="encoding"/>
</operator>
-<operator name="Write File" expanded="true" compatibility="8.2.000" class="write_file" activated="true" y="136" x="447" width="90" height="68">
<parameter value="file" key="resource_type"/>
<parameter value="C:\Users\Ittaj\Desktop\rapidminer\tests\%{t}-%{a}.txt" key="filename"/>
<parameter value="application/octet-stream" key="mime_type"/>
</operator>
<connect to_port="document" to_op="Extract Content" from_port="document"/>
<connect to_port="document" to_op="Unescape HTML Document" from_port="document" from_op="Extract Content"/>
<connect to_port="document" to_op="Write Document" from_port="document" from_op="Unescape HTML Document"/>
<connect to_port="document 1" from_port="document" from_op="Write Document"/>
<connect to_port="file" to_op="Write File" from_port="file" from_op="Write Document"/>
<portSpacing spacing="0" port="source_document"/>
<portSpacing spacing="0" port="sink_document 1"/>
<portSpacing spacing="0" port="sink_document 2"/>
</process>
</operator>
<connect to_port="example set" to_op="Process Documents from Data" from_port="Example Set" from_op="Crawl Web"/>
<connect to_port="result 1" from_port="example set" from_op="Process Documents from Data"/>
<portSpacing spacing="0" port="source_input 1"/>
<portSpacing spacing="0" port="sink_result 1"/>
<portSpacing spacing="0" port="sink_result 2"/>
</process>
</operator>
</process>
Answers
-
hmm I think your XML code is broken. Can you please just go to the XML panel and "copy and paste" into this thread?
0 -
thanks, i try it again:
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="34">
<parameter key="url" value="http://www.bbc.com/"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*tech.*"/>
<parameter key="follow_link_with_matching_url" value=".*news.*"/>
<parameter key="store_with_matching_url" value=".*zuckerberg.*"/>
<parameter key="store_with_matching_content" value=".*zuckerberg.*"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="max_pages" value="100"/>
<parameter key="max_depth" value="4"/>
<parameter key="max_threads" value="2"/>
<parameter key="max_page_size" value="10000"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
<parameter key="create_word_vector" value="false"/>
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34"/>
<operator activated="true" class="web:unescape_html" compatibility="7.3.000" expanded="true" height="68" name="Unescape HTML Document" width="90" x="179" y="34"/>
<operator activated="true" class="text:write_document" compatibility="8.1.000" expanded="true" height="82" name="Write Document" width="90" x="313" y="34"/>
<operator activated="true" class="write_file" compatibility="8.2.000" expanded="true" height="68" name="Write File" width="90" x="447" y="136">
<parameter key="filename" value="C:\Users\Ittaj\Desktop\rapidminer\tests\%{t}-%{a}.txt"/>
</operator>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Unescape HTML Document" to_port="document"/>
<connect from_op="Unescape HTML Document" from_port="document" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="document" to_port="document 1"/>
<connect from_op="Write Document" from_port="file" to_op="Write File" to_port="file"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
This type of setting works for me, retrieving artickes with Zuc in those:
When you say "it doesn't work", what exactly do you mean? Does the process hang, or deliver wrong results?
0 -
hi @kypexin
i tried a lot of different variants (in rule application/value, and also depth and links)
usually the process runs for a second and there are no results. sometimes i got a few results (less than 20, but i need around 100).
I'm trying it right niw with your rules, it runs since 2 minutes, i will update soon.
0 -
so i tried it again with your rules, and i only got 8 results, with some duplicates.
any idea how i can crawl a newssite for zuckerberg and get 100 results?
0 -
@ittaj_goldberge does the news site have more than 8 zuckerberg articles? You might have to change the depth parameter to dig deeper?
0 -
hi @Thomas_Ott
when i go to the search bar on bbc and look for zuckerberg, there are 1000s of results..
https://www.bbc.co.uk/search?q=zuckerberg#page=5
0 -
@ittaj_goldberge I'm by all means not a web crawling expert but lately for some client work I was exposed to web browser automation. Websites have gotten smart and in order to prevent people from crawling their websites they created various scripts to hide content that wasn't on the first page or 'above the fold.'
I suspect that this is the case. The link you shared was really a search that you used. It required a browser to access and probably doesn't work with a web crawler like RapidMiner. So that could be the problem.
0 -
If thois is the case @Thomas_Ott had mentioned, I might also expect that you could probably play around with 'user agent' and 'obey robot' parameters of Crawl Web operator (namely, change user agent string and disable the checkbox and then compare the results):
1 -
hi your answer is in this website0