Challenge with RM Server - Running out of memory
Hi there!
I am a newbie and this is my first post in the community. We have got a RM server installation on top of a MS SQL server box. We have a job container with 64GB RAM. I have built some workflows using sample data on studio environment and trying to run those processes after necessary changes in server environment connecting to original SQL data tables. These workflows mainly involve some basic data joins and summarization after application of few domain specific business rules.
When I am trying to run a flow, I am quickly running into the issue of Running Out of Memory. The challenge I have is, even the first part of my flow which involves reading few variables from a 40GB dataset is not getting completed. Due to the nature of data and business knowledge involved, I am not in a position to share the XML flow or log files here.
I have few specific questions for the community:
1. How does RM Server handles memory internally? Will the whole source data file be read and kept in memory while processing?
2. What is the maximum database size at source that can be handled by a 64GB single container?
3. Will you recommend RM server for huge data processing operations (i.e. data running closer to a TB in size).
Thanks,
Ramesh
Answers
-
Hi @Ramesh_T ,if you try to load a whole 40GB table into RapidMiner it could be possible that you can run into memory issues with 64 GB of RAM. If you only need a few variables from the data set, you can either query them directly in the Read Database operator, or in case you have a more complex ETL workflow, take a look a the In-Database Processing extension, at the RapidMiner marketplace.With that you can shift most of the pre-processing workload from RapidMiner to the database.Best,
David
3 -
Hi @Ramesh_T - I of course agree with everything @David_A has said above. If you are doing basic ETL like joins on large DB tables, you are almost always going to be better off doing those in-database rather than in RapidMiner. The In-Database Extension is quite good especially if you don't want/like writing SQL.
Another nice tool to use for cases like this is the caching operators from Old World Computing in their Jackhammer extension. They have just published some new blog articles showing how this is done - you can find part 1 here: https://oldworldcomputing.com/tutorial-introduction-to-caching-functions-of-the-jackhammer-extension-by-old-world-computing/ It is designed almost exactly for your use case. I'm cc'ing @land in case he has something more to add.
Scott3 -
Hi,@David_A I didn't know about the In-Database Processing Extension, it is quite useful, thanks!@Ramesh_T if you don't need to summarize across all rows (or if you can reconstruct the summary of all rows from sub-summaries, for example for the mean of a column) you can also fetch the rows in batches. Otherwise, it makes sense to preprocess the data in the database and then use RM for the machine learning parts.Regards,Sebastian0