Extract Information

Question

Hi - I'd like to try extract only the company names from this web page https://www.digitalmarketplace.service.gov.uk/g-cloud/search?q=, i.e. the second piece of text in each block.  Is Process Documents from the Web, and Extract Information, the most efficient way to do this?  And I'm new to Rapidminer and XPath, and wondered if anyone could advise the right XPath query expression to extract only the company name?

Telcontar120 · Accepted Answer

I can't help with the XPath query (I find XPath to be very finicky) but the attached process using simple string matching should do the trick.  You can also do it with RegEx if you prefer.

Telcontar120 · Accepted Answer

Yes, because get pages returns an exampleset rather than a document collection, you may need to add nominal to text and extract document operators as well, and put the subsequent processing inside a loop to iterate over multiple pages.