read data from html tables on web pages

Flixport
Flixport New Altair Community Member
edited November 5 in Community Q&A
Hey all,

has the operator HTML Reader been deleted from the new version or why can I not find it? 
Would be nice if someone answers me, thanks.

Best Answer

Answers

  • Flixport
    Flixport New Altair Community Member
    Hello @varunm1

    As I understand, the Web Table Extraction extracts data from an HTML table. But The data we are interested in is often not tabulated. Is there a solution for this?

    thanks

  • varunm1
    varunm1 New Altair Community Member
    Hi @Flixport

    Not sure about this. @Telcontar120 or @mschmitz can suggest on this

    Thanks
  • Telcontar120
    Telcontar120 New Altair Community Member
    There are definitely ways to get data from web pages into RapidMiner but it is not necessarily simple or straightforward depending on the page structure (that's why there's a whole expert training class just on web mining!).  It's also complicated by the fact that some of the web mining operators have not been updated in some time and so there are some "quirks" you need to be aware of.  But if you are interested in this topic you should download the free web mining extension from the marketplace and take a look at the Get Page operator to start.  This will allow you to pull in any html page and then you can try to extract the information you need with some of the other text mining operators (from the underlying html).
  • sgenzer
    sgenzer
    Altair Employee
    yes so just to be clear there are actually two extensions we're talking about here: the Web Mining extension and the Web Table Extraction extension.

    The Web Mining extension is a rather dated one and the advice from @Telcontar120 should help you there.

    The Web Table Extraction extension was developed out of RapidMiner Research in Dortmund; my colleague @ey wrote the extension and an accompanying Knowledge Base article about a year ago that may help.

    Scott
  • Flixport
    Flixport New Altair Community Member
    edited March 2019
    Hey all,

    thank for the answers. I think you can also as a solution to convert the HTML document into an XML document or is that not possible?

    thanks