Webmining: need help for webcrawling with
TB161
New Altair Community Member
Hello community members,
I am looking for a way to do web crawling. Now I have read in the forums that https websites cannot easily be crawled using the operator "Web Crawl". You would have to use a combination of "get pages" and "loop", like described (from Telconstar) , but I haven't found anything about this approach yet.
I will briefly explain what I want to crawl. I would like to crawl the properties displayed from a german real estate website (immowelt.de).
I am looking for a way to do web crawling. Now I have read in the forums that https websites cannot easily be crawled using the operator "Web Crawl". You would have to use a combination of "get pages" and "loop", like described (from Telconstar) , but I haven't found anything about this approach yet.
I will briefly explain what I want to crawl. I would like to crawl the properties displayed from a german real estate website (immowelt.de).
Typically, the location can be accessed via a link; Room from; Roomto; buy or rent; the order of the sorter:
immowelt.de/liste/muenchen/wohnungen/kaufen?roomi=2&rooma=2&sort=relevanz
The properties displayed are then listed, the link is made up of the constant expose and the ID of the offer, see below:
immowelt.de/projekte/expose/k2rb332
With the "web crawl" operator it would be easy, one would simply give the statement "expose" as a parameter for the crawl
How about "get pages" and "loop"? The ID doesn't count up, I would be very grateful if you could help me.
I wish you and your families a nice weekend
Regards
TB161
I wish you and your families a nice weekend
Regards
TB161
Tagged:
0
Answers
-
A typical work flow could be like this :
Crawl first page and extract next to your regular content also the indicator for te amount of pages.
For your example this would be8 Objekte zum Kauf (insgesamt 141 Wohneinheiten im Projekt)
So we know there are 8 in total, and the site shows 6 on a page so we can create a macro that stores our pages (ceiling of 8 divided by 6 gives 2 pages)
Next you need to do some reverse page engineering to understand how a website moves from one page 2 another. If you are lucky it's something like mysite.com/page?nextpage=2 so you create a loop flow where you crawl the page but increment the page parameters each time so like
mysite.com/page?nextpage=3
mysite.com/page?nextpage=4
...
Till the last page you need
Now, your page seems to load dynamically (not moving to a new page but just adding on the previous load) so it's not straight forward in this case. You'll probably need to look at the page load sequence (using Google inspect - network) to see which page is loaded behind the scenes.
Hope this gets you started0 -
Hello Kayman,
thank you for your suggestions...I tried it the last days, but unfortunately my experience is limited.
Therefore I use Parsehub for crawling, the rest I will do in redmine.
Thanks for your support !!
regards TB0 -
You could also store the html of the original page with your query results, and then extract all the links out of that page (using regular expressions) and put them in a csv file, and then use the "Get Pages" operator instead. Either way some creative workarounds are needed here. How I wish RapidMiner would fix the https issue for the Crawl Web operator!0
-
Hello Brian...
good idea, this could fly....but isn't it hat the html have only the "first" page...?
When I teh results have several pages, I don't know how to crawl them.
Regards
TB0 -
My suggestion assumes that the html links for all the pages with the property ids must be embedded in the raw html of the page somewhere (don't you click on a specific property to view it). So you can save that raw html as a document, then use document processing to extract all the links, then put those links in a file to use at the input to the Get Pages operator.0