"Cut Document II Crawling"

Question

hi there, I did notice there is another post about cutting document raised by Roberto and answered by Matthias.

However, first part of my problem is a little bit different from that post, but I believe it is an even easier one for the people who know how to solve it.

Questions:

1. I will retrieve a web page, e.g. Terms of service page of Google. I want to put each paragraph into a raw in the output excel. I am not familiar with regular expression kind of things, please help me here.

2. Does RM support to crawl the Internet, say, finding hundreds of pages returned by search keyword "Terms of Service"?

Thanks in advance.

colo · Answer

Hi Flake, this looks good to me. I would probably prefer "Filter Examples" to get rid of the empty rows instead of using "Remove Duplicates", but this isn't really important. Since you are using more than one cut expression for the "Cut Document" operator, you may perhaps want to know where an example came from. If you are interested in this, you can activate "add meta data" for "Process Documents" and identify the source by looking at the attribute query_key (lots of the other attributes can be filtered out by using "Select Attributes"). If you don't need this information you're already fine. You have some possibilities for changing operator chaining a bit (e.g. put the HTML removal inside "Cut Document", putting "Cut Document" inside "Process Documents", etc.) but this doesn't really change anything. If I had created such a process this would probably look the same. Regards Matthias

Flake · Answer

Dear Matthias, Many thanks for your help! It works for my purpose with few simple tweaks. :) Below is my process. Actually what I added are the things to remove the HTML tag sort of things and extract only the texts. But I run into problems such as several empty rows are generated due to my solution. Then, I had to add another Remove Duplicate operator to remove them. However, 'cause I am learning to use RM, I believe I didn't do it in the best way. If you are interested, could you give some suggestions on how to improve here?

colo · Answer

Hi Flake, let's see if I can answer the second cut document topic as well ;) If you want to get each paragraph (or some other HTML element) out of a website, I would probably prefer using XPath rather than writing regular expressions. The expression //h:p will find every paragraph at any depth (h is the default namespace for HTML elements): RapidMiner provides the "Crawl Web" operator for crawling but this is very slow when checking keywords within the document content. Perhaps some alternative crawlers (e.g. HTTRACK, Heritrix) will perform much better. Maybe someday an advanced crawler will replace the current implementation. There are one or two older topics with discussions about this. Regards Matthias P.S. Please consider posting questions like this in the "Problems and Support Forum". In my opinion the forum's description is closer to many of the topics created here.