job container forcibly killed
rur68
New Altair Community Member
i run job modeling in rapidminer server. after all the operator sucessfully finished and the result sucesfully stored and correct, the process end with error state
"Job container '1' was killed forcefully and therefore the job execution has been stopped. Reason: Restart of job container has been invoked".
i didnt change the behaviour of container restart policies so it still default.
Has anyone else encountered anything similar? Any suggestions on to diagnose the issue?
"Job container '1' was killed forcefully and therefore the job execution has been stopped. Reason: Restart of job container has been invoked".
i didnt change the behaviour of container restart policies so it still default.
Has anyone else encountered anything similar? Any suggestions on to diagnose the issue?
Tagged:
0
Best Answer
-
Hi @rur68. With 9.5.0 we introduced the persistent job container which speeds up execution. Job containers are no longer shut down after each job execution but instead kept alive so jobs don't need to wait for the job container boot time any more. Just mentioning it as it's a fundamental change in architecture.Related to your problem:
- Would it be possible to try out our latest 9.7.1 Server/AI Hub release?
- Do you experience any network related problems on the machine the JobAgent is running on? The timeout we see in the log could be related to network problems.
- Do you have enough memory on the machine where the JobAgent is running? Does the process maybe take more than the job container has?
- You could try to increase the maximum error amount and time between the health checks of the Job Container. If those limits are exceeded, the Job Container is flagged as unresponsive, killed and then restarted. You can give it a try by adding the following properties to the agent.properties file. If they help, then your machine is probably overloaded or your local network interface might experience hick-ups.
# amount of errors tolerated before shutdown
jobagent.container.maxErrorAmountBeforeSpawn = 10
# time between errors in milliseconds
jobagent.container.maxTimeBetweenErrors = 100005
Answers
-
Hi. Job containers are configured by default never to restart. If they do, it's typically because of some problem with the job. Is there any other problem in the log? Does this always happen with that job?0
-
hi, @jpuente . thank you for your response, actually i already solved this.
i got many warning "matrix is singular" in the log. it's probably because of my data that im trying to predict. i exclude the problem and then it run well.
but, this eror keep coming after i did upgrade to rm server 9.6. some of my job that was ok in the previous version is end with error state like this. i don't know whats going on, is 9.6 version has a "warning sensitive" like this?
0 -
Hi. No change that should have change behaviour that way. We could try to dig a bit deeper if you sent the agent config file and the full log.0
-
hi, @jpuente here's the agent config file and the log.
fyi, the previous version i used is 9.0. and this is not the only job causing job container killed, i have another one job that always end with error state like this but the result is sucesfully stored.0 -
It looks like the JC becomes unresponsive right after completing the job. I'll share internally and see what we can find.0
-
Hi @rur68. With 9.5.0 we introduced the persistent job container which speeds up execution. Job containers are no longer shut down after each job execution but instead kept alive so jobs don't need to wait for the job container boot time any more. Just mentioning it as it's a fundamental change in architecture.Related to your problem:
- Would it be possible to try out our latest 9.7.1 Server/AI Hub release?
- Do you experience any network related problems on the machine the JobAgent is running on? The timeout we see in the log could be related to network problems.
- Do you have enough memory on the machine where the JobAgent is running? Does the process maybe take more than the job container has?
- You could try to increase the maximum error amount and time between the health checks of the Job Container. If those limits are exceeded, the Job Container is flagged as unresponsive, killed and then restarted. You can give it a try by adding the following properties to the agent.properties file. If they help, then your machine is probably overloaded or your local network interface might experience hick-ups.
# amount of errors tolerated before shutdown
jobagent.container.maxErrorAmountBeforeSpawn = 10
# time between errors in milliseconds
jobagent.container.maxTimeBetweenErrors = 100005 -
hi thank you for your answer
1. unfortunately it's not possible to try 9.7.1 Server/AI Hub by now. but, what's fundamental change in architecture of this versions?
2. i think we didnt have problem in network because it run well on others job
3. also not the memory, i already increased the memory
4. this is the only option i can do and i already did and it works. but still confuse why it run well in version 9.0 but 9.6 got some errors like this.
anyway thank you very much @aschaferdiek
0 -
There's no fundamental change in architecture from 9.6 to 9.7.1, but it's always a good idea to have the latest version running. From 9.0 to 9.6 there is a fundamental change, Job Agent and Job Containers communicate internally via HTTP/REST. In 9.0 there was no such communication, so this couldn't pop up because there was no link between them at all (Job Container just started as a separate and entirely standalone OS process).Glad that changing the properties helped. Due to the fact that this helped, it's still very likely that it's some weird networking/machine problem. The timeout message still suggests that. I know, we cannot be sure, but there's no other reason why a simple HTTP request would timeout on localhost otherwise.Thank you for taking the time to try this out together with me, we'll consider increasing the defaults here.0
-
hi, @aschaferdiek .
i already added the following properties to the agent.properties file and it was work before.
now the same job didnt sucessfull at all because of the connection refuse in the process of building the model using deep learning, the process end with error state "Job container '1' was killed forcefully and therefore the job execution has been stopped. Reason: Restart of job container has been invoked". here's i attach the log file. Any suggestions on to handling this connection refuse in the middle of the process?# amount of errors tolerated before shutdown
jobagent.container.maxErrorAmountBeforeSpawn = 10
# time between errors in milliseconds
jobagent.container.maxTimeBetweenErrors = 100000 -
Hi @rur68. The main problem seems to be the same. The container gets killed because it has been unreachable for a certain amount of retries.Do you see the Job Container java process running in the operating system process list during RapidMiner execution? How about resource consumption?For me it still seems to be an overload of resources on the Job Container machine or a network problem. Could you monitor resources during execution? You could also try setting up a Job Agent on another machine with more resources!?0