Need to crawl webpages requiring login details

Vineet
Vineet New Altair Community Member
edited November 2024 in Community Q&A
Hello,
i need to crawl certain websites but they require login details to be entered.
I am not able to figure out how to provide my login details in order to get access to my homepage.Is there any operator or any other way to do that??
Please help.!!
Thanks and Regards,
Vineet
Tagged:

Answers

  • Skirzynski
    Skirzynski New Altair Community Member
    Hey,

    In the Get-Page operator is an option to activate cookies. Activate it and send your credentials (username and password) to the login page as POST parameter. Usually a web page will store your session in a cookie. Further requests of any Get-Page(s) operator will be handled by the website in the same session (using the stored cookies), thus you are logged in (if your credentials are correct) and can fetch the login-secured websites.

    Happy crawling!
      Marcin
  • Vineet
    Vineet New Altair Community Member
    Hello Marcin,
    Appreciate your quick reply. It was the same i was trying to do. But somehow i am not able to do it.
    Here is my code for your reference.
    Please tell me where am i going wrong.

    Thanks and Regards,
    Vineet
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="505" width="614">
          <operator activated="true" class="web:get_webpage" compatibility="5.2.003" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
            <parameter key="url" value="http://www.gmail.com"/>
            <parameter key="user_agent" value="  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.12 Safari/537.4 "/>
            <parameter key="read_timeout" value="1000"/>
            <parameter key="accept_cookies" value="all"/>
            <parameter key="request_method" value="POST"/>
            <list key="query_parameters">
              <parameter key="&amp;Email" value="infospace007@gmail.com"/>
              <parameter key="&amp;Passwd" value="infospace"/>
            </list>
            <list key="request_properties">
              <parameter key="&amp;Email" value="infospace007@gmail.com"/>
              <parameter key="&amp;Passwd" value="infospace"/>
            </list>
          </operator>
          <operator activated="true" class="read_excel" compatibility="5.2.008" expanded="true" height="60" name="Read Excel" width="90" x="45" y="120">
            <parameter key="excel_file" value="F:\Try\AuthLinks.xlsx"/>
            <parameter key="imported_cell_range" value="A2:B3"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="A.true.integer.attribute"/>
              <parameter key="1" value="B.true.binominal.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="5.2.003" expanded="true" height="60" name="Get Pages" width="90" x="246" y="120">
            <parameter key="link_attribute" value="B"/>
            <parameter key="page_attribute" value="MyPage"/>
            <parameter key="user_agent" value="  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.12 Safari/537.4 "/>
            <parameter key="accept_cookies" value="all"/>
            <parameter key="request_method" value="POST"/>
          </operator>
          <connect from_op="Get Page" from_port="output" to_port="result 1"/>
          <connect from_op="Read Excel" from_port="output" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>