Webmining: need help for webcrawling with

Hello community members,

I am looking for a way to do web crawling. Now I have read in the forums that https websites cannot easily be crawled using the operator "Web Crawl". You would have to use a combination of "get pages" and "loop", like described (from Telconstar) , but I haven't found anything about this approach yet.

I will briefly explain what I want to crawl. I would like to crawl the properties displayed from a german real estate website (immowelt.de).

Typically, the location can be accessed via a link; Room from; Roomto; buy or rent; the order of the sorter:

immowelt.de/liste/muenchen/wohnungen/kaufen?roomi=2&rooma=2&sort=relevanz

The properties displayed are then listed, the link is made up of the constant expose and the ID of the offer, see below:

immowelt.de/projekte/expose/k2rb332

With the "web crawl" operator it would be easy, one would simply give the statement "expose" as a parameter for the crawl

How about "get pages" and "loop"? The ID doesn't count up, I would be very grateful if you could help me.
I wish you and your families a nice weekend

Regards

TB161

Find more posts tagged with

AI Studio

Web Mining

Loops + Branches

Accepted answers

All comments

kayman

A typical work flow could be like this :

Crawl first page and extract next to your regular content also the indicator for te amount of pages.

For your example this would be

8 Objekte zum Kauf (insgesamt 141 Wohneinheiten im Projekt)

So we know there are 8 in total, and the site shows 6 on a page so we can create a macro that stores our pages (ceiling of 8 divided by 6 gives 2 pages)

Next you need to do some reverse page engineering to understand how a website moves from one page 2 another. If you are lucky it's something like mysite.com/page?nextpage=2 so you create a loop flow where you crawl the page but increment the page parameters each time so like

mysite.com/page?nextpage=3
mysite.com/page?nextpage=4
...

Till the last page you need

Now, your page seems to load dynamically (not moving to a new page but just adding on the previous load) so it's not straight forward in this case. You'll probably need to look at the page load sequence (using Google inspect - network) to see which page is loaded behind the scenes.

Hope this gets you started

TB161

Hello Kayman,

thank you for your suggestions...I tried it the last days, but unfortunately my experience is limited.
Therefore I use Parsehub for crawling, the rest I will do in redmine.

Thanks for your support !!

regards TB

Telcontar120

You could also store the html of the original page with your query results, and then extract all the links out of that page (using regular expressions) and put them in a csv file, and then use the "Get Pages" operator instead. Either way some creative workarounds are needed here. How I wish RapidMiner would fix the https issue for the Crawl Web operator!

TB161

Hello Brian...

good idea, this could fly....but isn't it hat the html have only the "first" page...?

When I teh results have several pages, I don't know how to crawl them.

Regards

TB

Telcontar120

My suggestion assumes that the html links for all the pages with the property ids must be embedded in the raw html of the page somewhere (don't you click on a specific property to view it). So you can save that raw html as a document, then use document processing to extract all the links, then put those links in a file to use at the input to the Get Pages operator.