There are many pages on the web that contain useful data in the form of simple html tables. Here's an example:
https://en.wikipedia.org/wiki/List_of_metropolitan_areas_of_the_United_States
I know how to use RapidMiner to retrieve this data automatically in html form using "get page" and store it as a document, and I even know how to do this iteratively if a set of related pages are required. I also have some familiarity with how to manipulate documents, but what I really want is to extract the information in the html table into a usable example set in RapidMiner. Is there any relatively simple way of doing the following:
- collect the table column headers and use them as attribute names
- collect each data row from the table and store it as an example
- identify and set the appropriate data type for each resulting attribute
It seems like it would be an incredibly useful operator that did all this automatically - "HTML table to data" or something similar. I'm fairly certain that such an operator doesn't exist (yet), but I'm not even sure of the collection of existing operators that would be required to do all of the above. Any ideas @mschmitz or @Thomas_Ott ?
Thanks.