"Crawl Web with follow_link_with_matching_url returning empty"

User: "alanbontempo"
New Altair Community Member
Updated by Jocelyn

Hi, I discovered rapidminer recently and I am impressed with its usability.

I am crowling the website and downloading some information from there (it is in portuguese):

 

http://www.portaldatransparencia.gov.br/servidores/Servidor-ListaServidores.asp?bogus=1&Pagina=1

 

I have a rule that download all pages with the matching regular expression:

 

.+Servidor-DetalhaServidor.+|.+Servidor-DetalhaRemuneracao.+

 

And it is working great. But in this page there is a "next" button that shows more samples. The "next" button send me to the following url:

 

http://www.portaldatransparencia.gov.br/servidores/Servidor-ListaServidores.asp?bogus=1&Pagina=2

 

So I inserted a "follow_link_with_matching_url" rule with the following regular expression:

 

.+Servidor-ListaServidores.+

 

But when I insert this rule I get empty results. Why is this happening?

 

Best Regards

Alan

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "Edin_Klapic"
    New Altair Community Member
    Accepted Answer

    Hi Alan,

     

    as @Thomas_Ott pointed out I highly assume that the problem results in the reason that the content of the pagination is based on JavaScript. Unfortunately, the Web crawler is not capable of accessing this.

    If I get you right you want to extract the data from every page. The website seems quite easy structured so you could e.g. access every page incrementally.

    In this case you could use the Loop operator and set the parameter number of iterations to the maximum number of pages.

    Then you could directly access each page using the URL

    http://www.portaldatransparencia.gov.br/servidores/Servidor-ListaServidores.asp?bogus=1&Pagina=%{iteration}

    In this case the macro iteration reflects the number of the page.

    Please turn off the Parallel execution for the Loop operator, otherwise your IP might get blacklisted.

     

    I hope this gets you nearer to the expected result,

    Edin