[ALMOST SOLVED] Web Crawling and Text Editing challenge

leon86it
leon86it New Altair Community Member
edited November 5 in Community Q&A
Kind people of the rapid-i,
I'm a very new beginner of the RapidMiner world and I am dealing with a project that seems harder than expected. Maybe it's just that I am still learning all the tools and operators of the RM...but here's the situation:

I've got a website where there are some news and articles: (i.e. www.parolibero.it)

I would like to do three things
1. Being able to Extract the articles from the website (text format or even better in XML format keeping the tags as Title, subtitle, body...)
2. Create an Excel list of the articles with title+url of the article
3. Export the data in a graphic format that would highlight some chosen differences: for example I would like to get a diagram where I can see how many articles have been written in that specific year or by that specific journalist (how is possible to use some search filters once I download the data files?)

I've tried to use the web crawling but all I get is the home page in txt format and then the Excel with just one record.

Can you please help me? At least I would like to know where I get wrong or which operators to use for that.

Thank you very much indeed for your help!
Leon

P.S. There is no copyright issue at all as I am one of the staff of that website

Answers

  • Nils_Woehler
    Nils_Woehler New Altair Community Member
    Hi,

    to extract information from the site you can for example use the Get Page Operator followed by Cut Documents and Extract Information, see here:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.007">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.007" expanded="true" name="Process">
        <process expanded="true" height="280" width="480">
          <operator activated="true" class="web:get_webpage" compatibility="5.2.001" expanded="true" height="60" name="Get Page" width="90" x="224" y="187">
            <parameter key="url" value="http://www.parolibero.it/"/>
            <parameter key="random_user_agent" value="true"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="380" y="210">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="articels" value="//h:ul[@class=&amp;quot;publicationsList&quot;]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="460" width="844">
              <operator activated="true" breakpoints="before" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="313" y="30">
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    One thing you have to notice is that for XPath every HTML identifier must have a 'h:' appended. Otherwise it won't work.

    Best,
    Nils