text extraction from html

New Altair Community Member

Jul 18, 2016

Updated Nov 5, 2024 by Jocelyn

After I got nearly immediatly response from the community for my first question I feel encouraged to ask another one

I'm trying to extract text from html. Therefore I would like to use Xpath instead of the Extract Content Operator.
Therefore I would like to use the Extract Information operator. But when I copy and paste the xpath I get from google chrome (//*[@id="content"]/div[1]/div/p[1]/text()I get no propper results.

In another post I read that I have to insert h: like //h:*[@id="content"]/h:div[1]/h:div/h:p[1]/text() - but no improvement.

Could you tell me what i did wrong?

Thanks in advance !

Find more posts tagged with

AI Studio

Text Mining + NLP

Sort by:

1 - 2 of 21

MartinLiebig

Altair Employee

Jul 18, 2016

Hi,

is it possible to post the full process? A bit hard to do this on the fly.

To be honest i got a bit lazy lately. The Aylien extension provides an Extract Article option which makes the parsing obsolute. The free API is capped at 1k pages/day though.

~Martin

Thomas_Ott

New Altair Community Member

Jul 18, 2016

I've done a bit of Xpath extraction using RapidMIner and found that you can't just paste the Xpath that Google gives you. Not sure why, but it doesn't work 99% of the time. I would look at the structure of the page and then build accordingly.

text extraction from html

Find more posts tagged with

Quick Links