text extraction from html
After I got nearly immediatly response from the community for my first question I feel encouraged to ask another one
I'm trying to extract text from html. Therefore I would like to use Xpath instead of the Extract Content Operator.
Therefore I would like to use the Extract Information operator. But when I copy and paste the xpath I get from google chrome (//*[@id="content"]/div[1]/div/p[1]/text()I get no propper results.
In another post I read that I have to insert h: like //h:*[@id="content"]/h:div[1]/h:div/h:p[1]/text() - but no improvement.
Could you tell me what i did wrong?
Thanks in advance !
Answers
-
Hi,
is it possible to post the full process? A bit hard to do this on the fly.
To be honest i got a bit lazy lately. The Aylien extension provides an Extract Article option which makes the parsing obsolute. The free API is capped at 1k pages/day though.
~Martin
0 -
I've done a bit of Xpath extraction using RapidMIner and found that you can't just paste the Xpath that Google gives you. Not sure why, but it doesn't work 99% of the time. I would look at the structure of the page and then build accordingly.
0