🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Crawl Web - empty results (PHP script)

User: "mspiess"
New Altair Community Member
Updated by Jocelyn

Hello there!

 

I'm a social scientist learning to use RapidMiner for data/text mining and text analysis. 

 

I've been trying to apply "Crawl Web" for the following address http://www.scielo.br/scielo.php?script=sci_issuetoc&pid=0102-690920180001&lng=pt&nrm=iso with no crawling rules applied and depth of 1, but I keep getting empty results.

 

I wonder if this is caused by the target page's php script. If so, does anyone know I workaround for this issue?

 

Also, any hints on setting the crawling rules so I get only the links with a specific link text. For example, in the URL above, I'm mostly interested in the pages with the text "Texto em Português".

 

Greeting from Brazil,

Maiko Spiess

Find more posts tagged with

Sort by:
1 - 4 of 41
    User: "sgenzer"
    Altair Employee

    hello @mspiess - welcome to the community. Have you tried looking at other threads in the community? A quick search revealed a thread that may be useful. https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Crawl-Web-with-follow-link-with-matching-url-returning-empty/m-p/38561#M26480

     

    Scott

     

    User: "mspiess"
    New Altair Community Member
    OP


    Hi @sgenzer! Thanks for replying.

     

    I have checked the thread you mentioned before posting my own but kept getting empty results. I figured if I try the operator without any rules it should return all the pages within the specified depth. Then I've tried this with a different URL and it worked okay. However, in this particular page I am still getting empty results.

     

    So, crawl rules aside, I'm still wondering if this is something related to the page's php script. Any thoughts?

     

    Greetings,

    Maiko

    User: "sgenzer"
    Altair Employee
    Accepted Answer

    so it seems that there is a bot block on that site. If you uncheck "ignore robot exclusion", you get results (I did only two pages just to test). So ethically I cannot tell you do this unless you own the site OR have explicit permission from the owner to crawl his/her site.


    Scott

     

    User: "mspiess"
    New Altair Community Member
    OP

    Okay! Got it!

     

    Thank you for your attention.