Web Mining

newbierapid
newbierapid New Altair Community Member
edited November 5 in Community Q&A
Hai All,

I am new to RM. Currently I am using RM5.0 version. My objective is to crawl web(Using Crawling operator) and I am able to save the URLs by giving the regular expression rules into an excel file.Now the problem is I am not able to see the content related to each URLs. After geting the content  I have to eliminate html contents in each page.

Can anyone suggest how to proceed further. It will be great if someone can explain with operator names in process order.

Thanks
Tagged:

Answers

  • colo
    colo New Altair Community Member
    Hi,

    I'm not really sure where the problem lies, since the description is a bit vague. You are using the "Crawl Web" operator and get URLs but no contents? then use the "add pages as attribute" parameter and you will get both. But I have no clue how regular expressions and an Excel file should be related to this... Perhaps you might provide some more details about what you have done (perhaps post your process XML) and where you couldn't achieve further goals.

    Regards
    Matthias
  • newbierapid
    newbierapid New Altair Community Member
    Hi Mathias,

    Sorry for less informatino regarding this. I have used Crawler operator to crawl a website. I have followed the way you suggested, Now I am able to get the URLs listed. I would like to see the content in each url ,kindly excuse if its a silly question. After geting the content I have to remove each tags in that page and do further processing .Here I am posting my XML code.

    Thanks in advance

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.011">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
        <process expanded="true" height="605" width="692">
          <operator activated="true" class="web:crawl_web" compatibility="5.1.003" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="75">
            <parameter key="url" value="http://www.asklaila.com/search/Bangalore/-/shopping malls/?searchNearby=false&amp;amp;ac=true"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_text" value=".*Shopping Malls.*"/>
            </list>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Documents and Settings\Sudheendra\Desktop\b"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="4"/>
            <parameter key="user_agent" value="Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2; .NET CLR 1.1.4322)"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • colo
    colo New Altair Community Member
    Hi,

    sorry I'm a bit confused... Link and website content are already there (the latter is contained in the attribute Page). If you want to get rid of the HTML markup you might do something like this:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.011">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
       <process expanded="true" height="605" width="692">
         <operator activated="true" class="web:crawl_web" compatibility="5.1.002" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
           <parameter key="url" value="http://www.asklaila.com/search/Bangalore/-/shopping malls/?searchNearby=false&amp;amp;ac=true"/>
           <list key="crawling_rules">
             <parameter key="follow_link_with_matching_text" value=".*Shopping Malls.*"/>
           </list>
           <parameter key="write_pages_into_files" value="false"/>
           <parameter key="add_pages_as_attribute" value="true"/>
           <parameter key="output_dir" value="C:\Documents and Settings\Sudheendra\Desktop\b"/>
           <parameter key="extension" value="html"/>
           <parameter key="max_pages" value="4"/>
           <parameter key="user_agent" value="Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2; .NET CLR 1.1.4322)"/>
         </operator>
         <operator activated="true" class="text:process_document_from_data" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="30">
           <parameter key="create_word_vector" value="false"/>
           <parameter key="keep_text" value="true"/>
           <list key="specify_weights"/>
           <process expanded="true" height="607" width="763">
             <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.002" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
             <connect from_port="document" to_op="Extract Content" to_port="document"/>
             <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
         <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    Regards
    Matthias
  • newbierapid
    newbierapid New Altair Community Member
    Thanks Mathias