Community & Support
Learn
Marketplace
Discussions
Categories
Discussions
General
Platform
Academic
Partner
Regional
Explore Siemens Communities
User Groups
Documentation
Events
Altair Exchange
Share or Download Projects
Resources
News & Instructions
Programs
YouTube
Employee Resources
This tab can be seen by employees only. Please do not share these resources externally.
Groups
Join a User Group
Support
Home
Discussions
Community Q&A
Loop operator issues
rur68
Hello, everyone
i want to crawl a web with loop operator. the website defines first page is the latest pages. but the information i want to get is the first five page of the newest.
can loop operator in rapidminer do iteration backwards?
Find more posts tagged with
AI Studio
Loops + Branches
Accepted answers
rfuentealba
Hello
@rur68
,
For the sake of simplicity, I'll be using this url instead of the one you provide:
http://jakarta.id/page/%{iteration}/site
.
You are iterating numbers 1, 2 and 3. Right?
Well, with number 1, that URL becomes:
http://jakarta.id/page/1/site
.
With number 2, that URL becomes:
http://jakarta.id/page/2/site
.
With number 3, that URL becomes:
http://jakarta.id/page/1/site
.
If you have a known page number (e.g., 500), then you may use the
Generate Macro
operator, giving a name to the macro generated (like: "calculated_page", with the following code:
<b>500</b> + 1 - eval(%{iteration})
That way, with number 1 you will get:
http://jakarta.id/page/500/site
.
With number 2, that URL becomes:
http://jakarta.id/page/499/site
.
With number 3, that URL becomes:
http://jakarta.id/page/498/site
.
However, that is for a
known
number. If you are seeking for an
unknown
number, (e.g., some 1000 new results come every day and you want to crawl those), then you might be out of luck (but people in this community is amazing, they might come up with a solution) and I would recommend you to use something not-so-rapidminer-ish like
httrack
on UNIX machines (Linux, Mac) to grab an updated copy of the site and use indexes or other tricks under the sleeve to handle these as files.
Word of caution:
httrack
and other site crawlers might be prohibited in your country, your mileage may vary.
Hope this helps, if I can come up with a better solution, then I'll be back to this thread.
All the best,
Rodrigo.
rfuentealba
Great, glad it helped!
All comments
rfuentealba
Hello,
Are you able to share an example of your data, and your XML for us to see and answer properly? There are many ways to go backwards inside a loop operator, but I need to figure out what are you doing first.
All the best,
Rodrigo.
rur68
hello,
@rfuentealba
.
here is the example process. i want to get the newest three pages of the website, but the result of the process is the latest three pages.
crawling web with loop.rmp
rfuentealba
Hello
@rur68
,
For the sake of simplicity, I'll be using this url instead of the one you provide:
http://jakarta.id/page/%{iteration}/site
.
You are iterating numbers 1, 2 and 3. Right?
Well, with number 1, that URL becomes:
http://jakarta.id/page/1/site
.
With number 2, that URL becomes:
http://jakarta.id/page/2/site
.
With number 3, that URL becomes:
http://jakarta.id/page/1/site
.
If you have a known page number (e.g., 500), then you may use the
Generate Macro
operator, giving a name to the macro generated (like: "calculated_page", with the following code:
<b>500</b> + 1 - eval(%{iteration})
That way, with number 1 you will get:
http://jakarta.id/page/500/site
.
With number 2, that URL becomes:
http://jakarta.id/page/499/site
.
With number 3, that URL becomes:
http://jakarta.id/page/498/site
.
However, that is for a
known
number. If you are seeking for an
unknown
number, (e.g., some 1000 new results come every day and you want to crawl those), then you might be out of luck (but people in this community is amazing, they might come up with a solution) and I would recommend you to use something not-so-rapidminer-ish like
httrack
on UNIX machines (Linux, Mac) to grab an updated copy of the site and use indexes or other tricks under the sleeve to handle these as files.
Word of caution:
httrack
and other site crawlers might be prohibited in your country, your mileage may vary.
Hope this helps, if I can come up with a better solution, then I'll be back to this thread.
All the best,
Rodrigo.
rur68
yeah it works! thank you so much,
@rfuentealba
!!
rfuentealba
Great, glad it helped!
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups