Is it possible to extract data from a list of URLs instead of first saving them?
Laser
New Altair Community Member
Hi,
I've recently discovered RapidMiner, and I'm very excited about it. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner) I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. I was excited when I read that the operator 'process documents from web' didn't need to store the html pages, but was dissapointed when it still needed to crawl itself. And it lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've been reading the manual, and I've read several pages on this forum. (couldn't find an answer) I have also seen a fair amount of tutorials. I'm still unable to figure it out. I could share the process I have right now.. but Ive just been collecting the operators that look useful to me, and been unable to connect them succesfully. So it probably won't make much sense. Any help is much appreciated. Thanks in advance.
~ George
I've recently discovered RapidMiner, and I'm very excited about it. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner) I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. I was excited when I read that the operator 'process documents from web' didn't need to store the html pages, but was dissapointed when it still needed to crawl itself. And it lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've been reading the manual, and I've read several pages on this forum. (couldn't find an answer) I have also seen a fair amount of tutorials. I'm still unable to figure it out. I could share the process I have right now.. but Ive just been collecting the operators that look useful to me, and been unable to connect them succesfully. So it probably won't make much sense. Any help is much appreciated. Thanks in advance.
~ George
Tagged:
0