Web Crawling for contact directory

Cash
Cash New Altair Community Member
edited November 5 in Community Q&A
I'm trying to crawl this site to create an Excel document containing the the names, locations, phone numbers, and specialty type of individual practitioners on https://www.psychologytoday.com/us/therapists 

The link above has links underneath for each state, and each state has about 50 pages or so of contacts.  I'm just trying to get the html pulled so I can later pull the contact data out, likely with Tableau Prep. The CSS tags I have from selector gadget are span , h1 , .location-address-phone

This is the operator I'm using, and it's returning absolutely nothing.  Can someone please help me figure this out?  Thanks!

<?xml version="1.0" encoding="UTF-8"?><process version="9.5.001">
  <operator activated="true" class="web:crawl_web_modern" compatibility="9.0.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="34">
    <parameter key="url" value="https://www.psychologytoday.com/us/therapists"/>
    <list key="crawling_rules">
      <parameter key="follow_link_with_matching_url" value="https://www.psychologytoday.com/us/therapists/.*"/>
      <parameter key="store_with_matching_url" value="https://www.psychologytoday.com/us/therapists/.*"/>
    </list>
    <parameter key="max_crawl_depth" value="52"/>
    <parameter key="retrieve_as_html" value="true"/>
    <parameter key="enable_basic_auth" value="false"/>
    <parameter key="add_content_as_attribute" value="false"/>
    <parameter key="write_pages_to_disk" value="true"/>
    <parameter key="include_binary_content" value="false"/>
    <parameter key="output_dir" value="/Users/ME/Desktop/Web Crawls"/>
    <parameter key="output_file_extension" value="html"/>
    <parameter key="max_pages" value="2500"/>
    <parameter key="max_page_size" value="10000"/>
    <parameter key="delay" value="500"/>
    <parameter key="max_concurrent_connections" value="100"/>
    <parameter key="max_connections_per_host" value="50"/>
    <parameter key="user_agent" value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"/>
    <parameter key="ignore_robot_exclusion" value="false"/>
  </operator>
</process>

Best Answer

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    Unfortunately the Crawl Web operator doesn't work with https pages (and has several other known problems besides). You can replicate its functionality by using Get Pages and preparing a csv file with the page links you want to store.  Since the page links seem to follow a regular pattern you can easily create such a list using Excel or even using RapidMiner.  That should enable you to store the data you want (also assuming it is not in violation of that site's T&C of use). 

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    Unfortunately the Crawl Web operator doesn't work with https pages (and has several other known problems besides). You can replicate its functionality by using Get Pages and preparing a csv file with the page links you want to store.  Since the page links seem to follow a regular pattern you can easily create such a list using Excel or even using RapidMiner.  That should enable you to store the data you want (also assuming it is not in violation of that site's T&C of use). 
  • Cash
    Cash New Altair Community Member
    Thank you, Brian.  That's disappointing to hear.  I don't think I'll be able to do this in RM, and I don't really know how to do the process you're referring to.  I verified within the T&C's that scraping was okay.  I was able to find a different SW that allowed me to scrape the site very easily.  So I have the information I was looking for.  Thank you again for the response!