Can I get solution for job related web scrawling from SEEK,INDEED?

Tirth
Tirth New Altair Community Member
edited November 5 in Community Q&A
Hi ,
I am doing research on job skills assessment as academic project.
I am looking for web crawling script or solution  for Job post details like job roles,location,skills and knowledge from job portal web sites like Indeed, Seek.Kindly help me in this matter.

Answers

  • rfuentealba
    rfuentealba New Altair Community Member
    Hi Tirth,

    There are plenty of things you can do:
    • Use the Get Pages operator from RapidMiner.
    • Use Python Extension and program your own script with scrapy (It's easier than what you think)
    • Use Python Extension and program your own script with Selenium Web Browser and BeautifulSoup (it's harder to do and requires some more software but has better results if your pages are generated with JavaScript).
    • Use a tool named "Sitesucker" and configure it to retrieve the data into RapidMiner. Then you can analyze the data inside RapidMiner coming from files.
    This is what I could come up with.

    All the best,

    Rod.
  • Tirth
    Tirth New Altair Community Member
    Many thanks for your answer. Actually, I want to use the co-association rule for job data extraction.I am just new for rapid miner.Can you help me more?
  • Tirth
    Tirth New Altair Community Member
    Many Thanks!I need to retrieve the data for the job market (building information modeling sector related ) like job role, location(only in New Zealand), requirements like skills, knowledge and experience with using the co-association rule.I want to perform several analysis on extracted information like co-occurrences.
    I really appreciate you for helping me.
  • rfuentealba
    rfuentealba New Altair Community Member
    First things first:

    Have you downloaded the pages you want to scrape on? And, do you have some HTML knowledge? Let's build your database first. I already gave you several solutions you can count on to retrieve pages. Then we will go for other processes.

    What will you do to download your data?

    All the best,

    Rod.
  • Tirth
    Tirth New Altair Community Member
    Yes ,I downloaded pages which is supposed to scrap.What do you recommend to do first. Can you please explain it?
  • rfuentealba
    rfuentealba New Altair Community Member
    Hi @Tirth,

    If you have your webpages downloaded already, do you have these as files inside of a directory, files inside many directories, or as entries in a database?

    The first thing we need to do is to make these look like entries in a database (or in a RapidMiner Studio exampleset). For that, you need to do the following (Let's use just one file to build our process, then we will use loops to open all files, ok?).

    First, pick a file, open it with your browser, read the code and identify the HTML structure. You may help yourself with the "Inspect Element" feature of Firefox and Chrome. Are you able to identify, inside an HTML file, how the job offers are identified? An example:

    <div class="jo" id="14342">
        <h1>Data Scientist</h1>
        <h2>Boston, MA</h2>
        <p>RapidMiner, Inc., requires a Data Scientist with skills in a, b, c and d. For more information, contact Scott Genzer at the following e-mail.</p>
    </div>

    You then can know that if you read all the <div> elements with class jo, you can have all the divs that contain job offers, which is what we are looking for.

    BTW, I forgot: did you ask for permission to the website owners do this? Some of them don't really like users to crawl their webpages.

    All the best,

    Rod.