Extracting data From WEb pages

hi,
I am trying to extract data from HTML pages . I tried with both Regular expressions and Xpath queries .

I was ,able to extract some details by using Xpath queries, but since the html page from which i am extracting is so complex ,that i am not able to make out the tag hierarchy.So its very diffficult to specify the XPath queries , for all the data

Is there any other method to find out the hierarchy of the html , so that i can extract the details using Xpath queries.

regards,
siju sony mathew

Find more posts tagged with

AI Studio

Accepted answers

All comments

land

Hi,
you might try to solve your problems by using tags with certain attribute values as anchors for your xpath querry. For example div tags with a class, id or name attribute.
For easier orientation in the DOM tree, you could use a DOM explorer available for every browser. It shows the DOM tree in a explorer like fashion, making orientation easier. Some even support selection of tags by clicking in the according area of the web page itself.

Greetings,
Sebastian

sijusony

hi,

Thankyou for your suggestion ,I was able to extract data from some intranet RSS feeds.
But i am having 2 problems now
1)With the user agent i am using ( ie the rapid miners default user agent), i am not able to crawl internet rss feeds.Is there any user agent by which we wud be able to crawl sites....I am trying to crawl www.ndtv.com, but i am not able to do the same with the rapid mminers default user agent.........Is there any method to find out which user agent is being supported by a website.
2)If the webpage is not having wellformed HTML format, is there any way to extract the data as , xpath queries would work only with wellformed HTML pages

greetings,
Siiju

land

Hi Siju,
most sites should support one of the most common browsers, especially the Internet Explorer. If this does not work, the site might exclude crawlers in the robots.txt
If XPath does not work, you could use Regular Expressions for specifying interesting regions.

Greetings,
Sebastian