Different results each time I run this process

I've got a my first RM personal-learning process working, but it oddly gives me a different answer each time I run it.
If I store the data after Step1 and re-run Step2, it gives the same answer each time, but a slightly wrong answer. If I re-run Steps1&2, then I get different results each time, even though I can verify directly that the source site has not changed.
So not sure if it's a bug, or if I've introduced an error somewhere in my process. I've attached my files in the hope that someone may be able to spot something.
Best Answer
-
In case anyone is interested, I found that if I run Process Documents from Web 5 times (4 didn't quite get there), then Append the 5 outputs, and Remove Duplicates, I get the entire set of data with nothing missing and nothing duplicated.
0
Answers
-
I've rebuilt the process in a different way using Extract Information inside Cut Document inside Process Documents from Web. This approach is also producing a small percentage of incorrect results, e.g. the following entry is somehow extracted twice. The total number of examples extracted is correct, but some are duplicated and some missing.
<a href="/g-cloud/services/100201645788425">Peer-to-peer support planning and brokerage</a>
This is proving a great learning exercise, but I can't for the life of me see where the problem Is in my set-up. Wondered if anyone can help?
(One interesting learning here for me is that the Max Crawl Depth seems to be controlling the page iteration without the need for a Loop operator.)
0 -
Interesting.
So, I’ve found the root cause of the problem. Seems that both the Rapidminer process, and the source web site, are working correctly, but the web site itself has a rather odd curious feature.
Inspecting the data stored after the Process Documents from Web operator, I could see that less than 3% of the 25k examples were duplicated. And I noticed the duplicates always bridged consecutive pages.
Turns out, in the web site itself, that when I refresh a page URL directly in the browser (say “page=125”), the sort occasionally toggles to an alternative sequence. It presumably does this every x number of views.
So Process from the Web picks up 100 items from page x, then when crawling through page x+1, it may pick something up again from the prior page because of the re-sort, and lose something in exchange. Hence the overall total of 25,260 examples returned by Rapidminer always conspiratorially matched the web site total.
Not sure if there is a clever way to overcome that? Instead of crawling the 253 search results pages, each with 100 items which have the unfortunate tendency to hop about and hide, I could go direct to the 25k lower-level pages. I would have preferred to mine only the results summary pages.
0 -
In case anyone is interested, I found that if I run Process Documents from Web 5 times (4 didn't quite get there), then Append the 5 outputs, and Remove Duplicates, I get the entire set of data with nothing missing and nothing duplicated.
0