Querying TERADATA takes too long

RapidMinerUser12
RapidMinerUser12 New Altair Community Member
edited November 5 in Community Q&A
Hi all,

I have a very big TERADATA database that has more than 35 000 000 rows. 
When I query the data for 1 000 000 rows, the Read Database operator executes in 12s, and when I tried to select all rows, the process ran for >40 minutes and I had to stop it.

My question is, is this waiting time normal? If not, how can I shorten it and import all of the data from TERADATA in RapidMiner? 
I want to do the ETL in RapidMiner.

Thank you in advance.

Best Answer

  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    Hi,

    there are a few tricks you can do.

    The first is using LIMIT ... OFFSET ... in SQL. However, this is often inefficient, so talk to your database admins. 

    Another would be finding a nominal attribute with 10 to 100 possible values and loop these, selecting the subset with the current value. This works if the distribution in your table is more or less balanced. (One value is not like 90 % of the table.)

    If you have numeric IDs, you can use a modulo function to only select IDs ending with 0, 1, 2, 3, etc. 

    It would be interesting for RapidMiner developers if you could benchmark the performance with 1, 5, 10, 15 etc. million rows and write here the threshold where it becomes impracticable. Maybe they could optimize the memory handling with that information.

    Regards,

    Balázs

Answers

  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi!

    This is probably a memory issue. RapidMiner works by reading complete data sets into the main memory on your computer. If you overwhelm the existing memory with data (and this sounds like you're doing that), everything gets slow, e. g. because of swapping.

    It's better to process the 35 M rows in batches, like you did with the 1 Mio rows. You would for example use one of the Loop operators.

    With this kind of big data, always try to do as much as possible inside the database. It is better at filtering, joining and sorting than a separate in-memory process can ever be. 

    You don't even have to learn SQL for this if you use the In-Database Processing extension.

    Regards,

    Balázs
  • RapidMinerUser12
    RapidMinerUser12 New Altair Community Member
    Hi, 

    Thanks for your swift answer.

    We have 256 GB of memory on our machines. The In-Database Processing doesn't work with Teradata.
    Our requirements are so that we do everything from ETL in RapidMiner, not with queries.

    Can you explain further how we can process the data in batches? We have to have some pointer-like indicator that tells the database where to start the next batch of data.

    Thank you in advance.
  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    Hi,

    there are a few tricks you can do.

    The first is using LIMIT ... OFFSET ... in SQL. However, this is often inefficient, so talk to your database admins. 

    Another would be finding a nominal attribute with 10 to 100 possible values and loop these, selecting the subset with the current value. This works if the distribution in your table is more or less balanced. (One value is not like 90 % of the table.)

    If you have numeric IDs, you can use a modulo function to only select IDs ending with 0, 1, 2, 3, etc. 

    It would be interesting for RapidMiner developers if you could benchmark the performance with 1, 5, 10, 15 etc. million rows and write here the threshold where it becomes impracticable. Maybe they could optimize the memory handling with that information.

    Regards,

    Balázs