Web Mining, Crawl Web crawling rules...please explain?

Cash
Cash New Altair Community Member
edited November 2024 in Community Q&A
I used RapidMiner in my MBA program and it's been almost three years since I last touched it.  I just started a position where I'll be using it again and I'm a bit rusty.  I'm trying to scrape a site for some data (names, phone numbers, addresses, etc.) and put them into an excel file, however I'm not able to figure out the parameters.  I think my main issue is understanding what the crawling rules are.  What do they mean?  Which should I be applying?  I've Googled this and searched here, but I only get instructions specific to other users' questions.  Can anyone provide a definition of what these are and what they mean/do?

Best Answer

  • kayman
    kayman New Altair Community Member
    edited February 2020 Answer ✓
    @Cash , The traditional components won't work here, as this is a dynamic page loading a JSON file with all the locations separately.
    So what you will crawl and store is the skeleton only, containing the placeholders where the data will be injected during rendering.

    So this requires a bit of reverse engineering, I'll give you some tips but have to state that this might be on the borderline of what is ethical crawling.

    If you load the page in for instance firefox with the inspect element window open (shortcut Q on windows) and select the network tab you can see where this page get's all its content from. This goes from images over scripts etc, and one of the sources is a rather large json file called from an API, that seems to have all the locations.

    So purely in theory you can download this json file directly if the site owner has no problems with this, and use JSON to Data to deal with it from there.

Answers

  • [Deleted User]
    [Deleted User] New Altair Community Member
  • Cash
    Cash New Altair Community Member
    All I see is a brief description of Web Mining and the option to download it.
  • kayman
    kayman New Altair Community Member
    The web crawling field is so wide and very depending on the structure of a website/page that it would help if you give some examples of what (sides) you want to crawl and what you would need from a page.

  • Cash
    Cash New Altair Community Member
    @kayman the site I'm trying to build a list from is here:  https://www.naadac.org/sap-directory?locsearch=22314&loccountry=US&locdistance=any&sortdir=distance-asc

    I'm just trying to capture the names, locations, and phone numbers.  I used Selector Gadget to help me figure out the CSS tags I need and this is what it has given me:  .places-app-location-citystatezip , a , .places-app-location-street , .places-app-location-name
  • kayman
    kayman New Altair Community Member
    edited February 2020 Answer ✓
    @Cash , The traditional components won't work here, as this is a dynamic page loading a JSON file with all the locations separately.
    So what you will crawl and store is the skeleton only, containing the placeholders where the data will be injected during rendering.

    So this requires a bit of reverse engineering, I'll give you some tips but have to state that this might be on the borderline of what is ethical crawling.

    If you load the page in for instance firefox with the inspect element window open (shortcut Q on windows) and select the network tab you can see where this page get's all its content from. This goes from images over scripts etc, and one of the sources is a rather large json file called from an API, that seems to have all the locations.

    So purely in theory you can download this json file directly if the site owner has no problems with this, and use JSON to Data to deal with it from there.
  • Cash
    Cash New Altair Community Member
    @kayman Thanks!  That seems like a solid solution, albeit a bit out of my scope of ability.  If it borders ethical crawling it's something I'd tend to stay away from since this is for work and I don't want to do anything that might be questionable under our company policy.  I'll see about getting the data another way...or even just manually copying it.  Thanks again!