Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
Querying TERADATA takes too long
RapidMinerUser12
Hi all,
I have a very big TERADATA database that has more than 35 000 000 rows.
When I query the data for 1 000 000 rows, the Read Database operator executes in 12s, and when I tried to select all rows, the process ran for >40 minutes and I had to stop it.
My question is, is this waiting time normal? If not, how can I shorten it and import all of the data from TERADATA in RapidMiner?
I want to do the ETL in RapidMiner.
Thank you in advance.
Find more posts tagged with
AI Studio
Teradata
Read Database
Accepted answers
BalazsBaranyRM
Hi,
there are a few tricks you can do.
The first is using LIMIT ... OFFSET ... in SQL. However, this is often inefficient, so talk to your database admins.
Another would be finding a nominal attribute with 10 to 100 possible values and loop these, selecting the subset with the current value. This works if the distribution in your table is more or less balanced. (One value is not like 90 % of the table.)
If you have numeric IDs, you can use a modulo function to only select IDs ending with 0, 1, 2, 3, etc.
It would be interesting for RapidMiner developers if you could benchmark the performance with 1, 5, 10, 15 etc. million rows and write here the threshold where it becomes impracticable. Maybe they could optimize the memory handling with that information.
Regards,
Balázs
All comments
BalazsBaranyRM
Hi!
This is probably a memory issue. RapidMiner works by reading complete data sets into the main memory on your computer. If you overwhelm the existing memory with data (and this sounds like you're doing that), everything gets slow, e. g. because of swapping.
It's better to process the 35 M rows in batches, like you did with the 1 Mio rows. You would for example use one of the Loop operators.
With this kind of big data, always try to do as much as possible inside the database. It is better at filtering, joining and sorting than a separate in-memory process can ever be.
You don't even have to learn SQL for this if you use the
In-Database Processing
extension.
Regards,
Balázs
RapidMinerUser12
Hi,
Thanks for your swift answer.
We have 256 GB of memory on our machines. The In-Database Processing doesn't work with Teradata.
Our requirements are so that we do everything from ETL in RapidMiner, not with queries.
Can you explain further how we can process the data in batches? We have to have some pointer-like indicator that tells the database where to start the next batch of data.
Thank you in advance.
BalazsBaranyRM
Hi,
there are a few tricks you can do.
The first is using LIMIT ... OFFSET ... in SQL. However, this is often inefficient, so talk to your database admins.
Another would be finding a nominal attribute with 10 to 100 possible values and loop these, selecting the subset with the current value. This works if the distribution in your table is more or less balanced. (One value is not like 90 % of the table.)
If you have numeric IDs, you can use a modulo function to only select IDs ending with 0, 1, 2, 3, etc.
It would be interesting for RapidMiner developers if you could benchmark the performance with 1, 5, 10, 15 etc. million rows and write here the threshold where it becomes impracticable. Maybe they could optimize the memory handling with that information.
Regards,
Balázs
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups