[SOLVED] Crawl Web and generate reporting

pemguinkpl
pemguinkpl New Altair Community Member
edited November 5 in Community Q&A
Hi,

i have try the crawl web process, but the result showed no have any document i have crawled. May i know what is the problem?
I follow exactly the step from the video below, but encounter the problem.

http://www.youtube.com/watch?v=zMyrw0HsREg

Any help please... :-\

How to use the generate report n report operation in rapid miner?
Anyone know???

Thank You!

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    I didn't watch the video and don't have the time to.  Could you please post your process and describe more specifically what you are trying to do?


    Best regards,
    Marius
  • pemguinkpl
    pemguinkpl New Altair Community Member
    hi marius thanks for replied,

    my initially research is to analyze H1N1 news and using crawler to get all the news about h1n1.  This is the link i try to crawl

    http://my-h1n1.blogspot.com/search/label/news?updated-max=2009-07-26T02:03:00%2B08:00&;max-results=20

    but then i can't get any document.

    This is my process xml:

    <process version="5.1.014">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
        <process expanded="true" height="386" width="547">
          <operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
            <parameter key="url" value="http://my-h1n1.blogspot.com/search/label/news?updated-max=2009-07-26T02:03:00+08:00&amp;amp;max-results=20"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".+suiteid.+"/>
              <parameter key="follow_link_with_matching_url" value=".+pagenum.+|.+suiteid.+"/>
            </list>
            <parameter key="output_dir" value="D:\FYP\result\test\crawl"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_depth" value="1"/>
            <parameter key="delay" value="500"/>
            <parameter key="max_threads" value="4"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.27 Safari/532.0"/>
          </operator>
          <operator activated="false" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve" width="90" x="45" y="165">
            <parameter key="repository_entry" value="../new"/>
          </operator>
          <operator activated="false" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="30">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="9999"/>
            <list key="specify_weights"/>
            <process expanded="true" height="396" width="709">
              <operator activated="false" class="text:extract_information" compatibility="5.1.004" expanded="true" height="60" name="Extract Information" width="90" x="113" y="89">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="h1n1" value="(h1n1\W+(?:\w+\W+){1,5}?influenzah1n1)"/>
                  <parameter key="influenza" value="(influenzah1n1\W+(?:\w+\W+){1,5}?)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="false" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="300">
            <parameter key="repository_entry" value="../new"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    May i know what is the problem? Thanks =)

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    the problem is that the page you are trying to crawl does not allow to be crawled, and of course RapidMiner obeys this exclusion by default. The crawl operator has to options to ignore the so called robot exclusion, but as it says in the documentation, you are usually not allowed to disable it for pages which are not your own. These are the parameters:

    obey robot exclusion: Specifies whether the crawler obeys the rules, which pages on site might be visited by a robot. Disable only if you know what you are doing and if you a sure not to violate any existing laws by doing so. Range: boolean; default: true
    really ignore exclusion: Do you really want to ignore the robot exclusion? This might be illegal. Range: boolean; default: false

    Best,
    Marius
  • pemguinkpl
    pemguinkpl New Altair Community Member
    HI marius,

    thank you for the replies, it's solved my problem  ;)