Large data set model apply

Krystian
Krystian New Altair Community Member
edited November 5 in Community Q&A
Hi, I try to apply model on 10mln records database. I use "read database" operator but it copies all data from database to memory in my computer so it coses out of memory exception, moreover there is timeout on database. "Stream database" looks nice but it looks like it works only to make model not to apply (I got an error when applaying with this operator). I think about building a loop to get data with parametrized SQL limit - limiting data f.eg. to 10 000 records  is working very well in applying model. Please help - I think there is smarter way than making loops. Most of ETL got streaming DB read.
Thanks
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi Krystian,

    using a loop is a perfect work-around if Stream Database does not work for you. As always, posting your process setup and the details of the error message could be useful.

    Best, Marius
  • Krystian
    Krystian New Altair Community Member
    I got:
    Apr 11, 2012 1:19:44 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
    Apr 11, 2012 1:19:44 PM SEVERE: Here:          Process[1] (Process)
              subprocess 'Main Process'
          ==>  +- Stream Database[1] (Stream Database)
                +- Write CSV[0] (Write CSV)
    Apr 11, 2012 1:19:44 PM SEVERE: java.lang.NullPointerException

    with Stream database connected only to CSV output or even to screen:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
        <process expanded="true" height="341" width="480">
          <operator activated="true" class="stream_database" compatibility="5.2.003" expanded="true" height="60" name="Stream Database" width="90" x="112" y="165">
            <parameter key="connection" value="External"/>
            <parameter key="table_name" value="user_data_im_monthly_2012_02"/>
          </operator>
          <operator activated="true" class="write_csv" compatibility="5.2.003" expanded="true" height="76" name="Write CSV" width="90" x="380" y="165">
            <parameter key="csv_file" value="C:\Documents and Settings\GG\My Documents\tttt"/>
          </operator>
          <connect from_op="Stream Database" from_port="output" to_op="Write CSV" to_port="input"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
    Now I'am testing to export RMiner PMML export and use it in streaming process in Pentaho. I will write how it works. Thanks