SOLVED_RSS feeds & MySQL- 100 Records Only!
dudester
New Altair Community Member
I'll try to be brief: basically I have an issue trying to scrape complete RSS feeds into a MySQL database. Largely it works OK; for some reason that I can't decipher, it will only read 100 entries into MySQL, and lately has been freezing my computer, likely due to memory constraints. (I speculate that this may due to recent extension additions -Image Processing, IDA?).
Anyway, according to the log, the RSS feed is pulled is less than 5 seconds, then it hangs while it tries to display results. The system monitor shows available memory down to zip. I believe I have the MySQL settings correct; the data example set in Rapid Miner never pulls more than 100 entries at a time, even while I've got the batch size at 10,000. I need another pair of eyes...
So, the code below for the process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
<parameter key="logverbosity" value="all"/>
<process expanded="true" height="466" width="797">
<operator activated="true" class="web:read_rss" compatibility="5.2.000" expanded="true" height="60" name="Read RSS Feed" width="90" x="45" y="30">
<parameter key="url" value="http://some random feed=rss"/>
<parameter key="random_user_agent" value="true"/>
<parameter key="connection_timeout" value="100000"/>
<parameter key="read_timeout" value="100000"/>
</operator>
<operator activated="true" class="write_database" compatibility="5.2.003" expanded="true" height="60" name="Write Database" width="90" x="246" y="75">
<parameter key="connection" value="dbconnectionvalue"/>
<parameter key="use_default_schema" value="false"/>
<parameter key="schema_name" value="schema1"/>
<parameter key="table_name" value="tablename1"/>
<parameter key="overwrite_mode" value="append"/>
<parameter key="batch_size" value="10000"/>
</operator>
<connect from_op="Read RSS Feed" from_port="output" to_op="Write Database" to_port="input"/>
<connect from_op="Write Database" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Why the magic number of 100 feeds only pulled? I don't see it, either here or in Rapid Miner preferences.
Anyway, according to the log, the RSS feed is pulled is less than 5 seconds, then it hangs while it tries to display results. The system monitor shows available memory down to zip. I believe I have the MySQL settings correct; the data example set in Rapid Miner never pulls more than 100 entries at a time, even while I've got the batch size at 10,000. I need another pair of eyes...
So, the code below for the process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
<parameter key="logverbosity" value="all"/>
<process expanded="true" height="466" width="797">
<operator activated="true" class="web:read_rss" compatibility="5.2.000" expanded="true" height="60" name="Read RSS Feed" width="90" x="45" y="30">
<parameter key="url" value="http://some random feed=rss"/>
<parameter key="random_user_agent" value="true"/>
<parameter key="connection_timeout" value="100000"/>
<parameter key="read_timeout" value="100000"/>
</operator>
<operator activated="true" class="write_database" compatibility="5.2.003" expanded="true" height="60" name="Write Database" width="90" x="246" y="75">
<parameter key="connection" value="dbconnectionvalue"/>
<parameter key="use_default_schema" value="false"/>
<parameter key="schema_name" value="schema1"/>
<parameter key="table_name" value="tablename1"/>
<parameter key="overwrite_mode" value="append"/>
<parameter key="batch_size" value="10000"/>
</operator>
<connect from_op="Read RSS Feed" from_port="output" to_op="Write Database" to_port="input"/>
<connect from_op="Write Database" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Why the magic number of 100 feeds only pulled? I don't see it, either here or in Rapid Miner preferences.
Tagged:
0
Answers
-
Oops, my bad...nothing to do with either Rapid Miner or MySQL.
Apparently Yahoo Pipes limits the amount of data you can scrape at a time to 100 items. There is kind of a workaround but best to either use another online mashup, or perhaps a desktop variety for later input into DM.
From http://pipes.yqlblog.net/.
RSS pagination.
"Initial RSS output is now limited to the first 100 items. Each paginated page is limited to 100 items as well. To access each subsequent page add parameter &page=2…etc. to the pipe.run url to retrieve more items."0