SOLVED_RSS feeds & MySQL- 100 Records Only!

dudester
dudester New Altair Community Member
edited November 5 in Community Q&A
I'll try to be brief:  basically I have an issue trying to scrape complete RSS feeds into a MySQL database.  Largely it works OK; for some reason that I can't decipher, it will only read 100 entries into MySQL, and lately has been freezing my computer, likely due to memory constraints.  (I speculate that this may due to recent extension additions -Image Processing, IDA?).  
Anyway, according to the log, the RSS feed is pulled is less than 5 seconds, then it hangs while it tries to display results.  The system monitor shows available memory down to zip.  I believe I have the MySQL settings correct; the data example set in Rapid Miner never pulls more than 100 entries at a time, even while I've got the batch size at 10,000.  I need another pair of eyes...

So, the code below for the process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
   <parameter key="logverbosity" value="all"/>
   <process expanded="true" height="466" width="797">
     <operator activated="true" class="web:read_rss" compatibility="5.2.000" expanded="true" height="60" name="Read RSS Feed" width="90" x="45" y="30">
       <parameter key="url" value="http://some random feed=rss"/>
       <parameter key="random_user_agent" value="true"/>
       <parameter key="connection_timeout" value="100000"/>
       <parameter key="read_timeout" value="100000"/>
     </operator>
     <operator activated="true" class="write_database" compatibility="5.2.003" expanded="true" height="60" name="Write Database" width="90" x="246" y="75">
       <parameter key="connection" value="dbconnectionvalue"/>
       <parameter key="use_default_schema" value="false"/>
       <parameter key="schema_name" value="schema1"/>
       <parameter key="table_name" value="tablename1"/>
       <parameter key="overwrite_mode" value="append"/>
       <parameter key="batch_size" value="10000"/>
     </operator>
     <connect from_op="Read RSS Feed" from_port="output" to_op="Write Database" to_port="input"/>
     <connect from_op="Write Database" from_port="through" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

Why the magic number of 100 feeds only pulled?  I don't see it, either here or in Rapid Miner preferences.
Tagged:

Answers

  • dudester
    dudester New Altair Community Member
    Oops, my bad...nothing to do with either Rapid Miner or MySQL.

    Apparently Yahoo Pipes limits the amount of data you can scrape at a time to 100 items.  There is kind of a workaround but best to either use another online mashup, or perhaps a desktop variety for later input into DM.

    From http://pipes.yqlblog.net/.

    RSS pagination.
    "Initial RSS output is now limited to the first 100 items. Each paginated page is limited to 100 items as well. To access each subsequent page add parameter &page=2…etc. to the pipe.run url to retrieve more items."