"Crawl Web -- Enable Basic Auth"

carl
carl New Altair Community Member
edited November 2024 in Community Q&A

Hi - I'm brand new to Rapidminer as of this week (using Studio 7.3).  I'm using Crawl Web to access the web page http://www.thetimes.co.uk/search?q= (with added search parameters), and I can successfully return a set of news articles.  However each search result is returning only the first few paragraphs of each article because my login has not been recognized.  I've entered the correct account credentials in "Enable Basic Auth".  Any ideas please?

Best Answer

  • Thomas_Ott
    Thomas_Ott New Altair Community Member
    Answer ✓

    I remember watching this video https://www.youtube.com/watch?v=-Sr3i7klRHM a while back and I believe they past Twitter OAuth credentials in RapidMiner using Generate User Data and something else. This was right before we came out with the Twitter operators, but if you hack this you might be able to get into your login. 

     

Answers

  • sgenzer
    sgenzer
    Altair Employee

     

    hi...yes I have had the same issues with this operator.  You're doing everything correctly but I have found that "basic auth" feature rather hit-or-miss.  It's basic <grin>.  Note the help documentation says only to use this over https because it places the auth credentials in the header.  But in a news site like New York Times (where I have a subscription), that's not how it works.  I am not an expert in authentication so will defer to others on the differences here.

     

    That said, I have gotten this kind of thing to work in RapidMiner but it will not be one click like you are hoping...


    Scott

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Might I suggest checking out our Mozenda extension. Although you pay some $ to Mozenda, you can scrap things way easier. 

  • sgenzer
    sgenzer
    Altair Employee

    Thanks, Tom.  I forgot about Mozenda because it is never an option for me (it requires a Windows client) and it is very expensive.  But for those with Windows and a budget of $99+ per month, it is certainly a good option.


    Scott

  • carl
    carl New Altair Community Member

    Thanks for the responses.  I did take a brief look at Mozenda (looks interesting), but was hoping there might be an alternative approach for the same reasons as Scott, i.e. because I use a MacBook and because of the cost.  I know it is an option if I install a virtal machine program like VMware Fusion, so I may yet have to reconsider.  The Times login is https://login.thetimes.co.uk, so I had hoped that maybe I'd just misstepped in my set-up of Enable Basic Auth.

  • Thomas_Ott
    Thomas_Ott New Altair Community Member
    Answer ✓

    I remember watching this video https://www.youtube.com/watch?v=-Sr3i7klRHM a while back and I believe they past Twitter OAuth credentials in RapidMiner using Generate User Data and something else. This was right before we came out with the Twitter operators, but if you hack this you might be able to get into your login. 

     

  • sgenzer
    sgenzer
    Altair Employee

    That's a very helpful video.  Thanks, Tom.

  • carl
    carl New Altair Community Member

    Thanks Thomas.  I haven't had chance to try the video idea as I'm wrestling with Process Documents from the Web at the moment.  But will take a look when I get chance.

  • JEdward
    JEdward New Altair Community Member

    That's a great Youtube video!  Looks like it's also using one of my example processes from back in the day too!  #FeelingProud

    http://community.rapidminer.com/t5/RapidMiner-Server/SOLVED-Open-File-with-basic-authentication-in-RapidAnalytics/m-p/24073

    You might need to change a bit of the XML on this link to convert it from 5.3 to 7.3 formatting. 

     

    I have a whole set of template processes somewhere around that setup OAuth integration for a couple of email marketing APIs (Silverpop & DotMailer) as well as Twitter authentication. 

     

  • Marco_Boeck
    Marco_Boeck New Altair Community Member

    Hi,

     

    basic auth is the authentication where in your browser you'd get the ugly input dialog box overlay.

    If you have a form login (embedded login in web page), that's not basic auth anymore. The problem is that those logins would be theoretically be supported, but due to Cross Site Request Forgery prevention, it almost never works :( Thus it was excluded from the operator.

     

    Regards,

    Marco