text extraction from html

nrwstudent
nrwstudent New Altair Community Member
edited November 5 in Community Q&A

After I got nearly immediatly response from the community for my first question I feel encouraged to ask another one =)

I'm trying to extract text from html. Therefore I would like to use Xpath instead of the Extract Content Operator.
Therefore I would like to use the Extract Information operator. But when I copy and paste the xpath I get from google chrome (//*[@id="content"]/div[1]/div/p[1]/text()I get no propper results.

In another post I read that I have to insert h: like //h:*[@id="content"]/h:div[1]/h:div/h:p[1]/text() - but no improvement.

 

Could you tell me what i did wrong?

 

Thanks in advance !

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

     

    is it possible to post the full process? A bit hard to do this on the fly.

     

    To be honest i got a bit lazy lately. The Aylien extension provides an Extract Article option which makes the parsing obsolute. The free API is capped at 1k pages/day though.

     

    ~Martin

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    I've done a bit of Xpath extraction using RapidMIner and found that you can't just paste the Xpath that Google gives you. Not sure why, but it doesn't work 99% of the time. I would look at the structure of the page and then build accordingly.