Java OutOfMemoryErrors since upgrading to Rapidminer Server 9.6
PaulM
New Altair Community Member
We recently upgraded to RM Server 9.6 (from 9.3) and have various quite memory intensive processes. Several have started failing with Java OutOfMemoryErrors and what is more curious is that they will run fine once and later on fail despite running with the same amount of data.
These symptoms did make me wonder whether RM Server or Java is suffering from memory leak issues. Has anyone else encountered anything similar? Any suggestions on to diagnose the issue? (the Job Container is already running with maximum memory so we can't increase that).
These symptoms did make me wonder whether RM Server or Java is suffering from memory leak issues. Has anyone else encountered anything similar? Any suggestions on to diagnose the issue? (the Job Container is already running with maximum memory so we can't increase that).
0
Answers
-
Hi @PaulM . Thanks for your question. If they sometimes succeed and sometimes not, it seems to be related to the underlying JVM. The amount of memory shown in the job details page, is the allocated memory. So even if allocated is 100%, the amount of memory used by a job doesn't have to be 100% too.9.6 introduced the new Job Containers (persistent across executions). Could you try to set a restart policy so that after each execution the Job Container gets restarted? This would simulate old 9.3 behavior and then let's see if the problem still exists. For more information about the different policies please see our documentation and the agent.properties file. https://docs.rapidminer.com/latest/server/configure/jobs/job-agents.html1
-
Thanks @aschaferdiek. Before receiving your response I re-engineered the processes that were loading the biggest datasets into memory to break them down into more manageable chunks and loop through and on the face of it this enabled things to run again.
However last night the process failed with the error:
Job container forcefully killed
Job container '1' was killed forcefully and therefore the job execution has been stopped. Reason: Restart of job container has been invoked.
The restart wasn't something we triggered manually so it would seem your assessment of the aspect of the system that is causing the issues is correct. We will do as you suggest and report back.
1 -
Apologies for not reporting back; we revised the restart policy as you suggested and this resolved the issue. Does 9.7 introduce any further behaviour changes in this area?0
-
Hi @PaulM. No, the introduction of persistent job containers is the only important change in this regard but as you noticed you can simulate old behavior with the restart policy. I would still recommend you update to the latest release as some (other) bugs got fixed.If you experience any further problems, feel free to open a thread.0