RM Server: scheduled job terminating abnormally

PaulM
PaulM New Altair Community Member
edited November 5 in Community Q&A
Hi,

I have a long-running scheduled job on RM Server 9.6 that normally runs without any issues. However, it failed last night and shows the following error:

Execution exited abnormally
Failed to submit job. Reason: I/O error on POST request for "http://localhost:10002/jobs": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out
com.rapidminer.execution.jobagent.service.exception.ServiceException: Failed to submit job. Reason: I/O error on POST request for "http://localhost:10002/jobs": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out
at com.rapidminer.execution.jobagent.clients.rest.JobContainerRestClient.submitJob(JobContainerRestClient.java:73)
at com.rapidminer.execution.jobagent.service.executor.JobExecutorService.startProcess(JobExecutorService.java:227)
at com.rapidminer.execution.jobagent.service.executor.JobExecutorService.executeProcess(JobExecutorService.java:150)
at com.rapidminer.execution.jobagent.queue.JobMessageConsumer.executeJob(JobMessageConsumer.java:171)
at com.rapidminer.execution.jobagent.queue.JobMessageConsumer.acceptJobMessage(JobMessageConsumer.java:90)
at sun.reflect.GeneratedMethodAccessor171.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:180)
at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:112)
at org.springframework.jms.listener.adapter.MessagingMessageListenerAdapter.invokeHandler(MessagingMessageListenerAdapter.java:104)
at org.springframework.jms.listener.adapter.MessagingMessageListenerAdapter.onMessage(MessagingMessageListenerAdapter.java:69)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:719)
at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:679)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:649)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:317)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:255)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1168)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1062)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.springframework.web.client.ResourceAccessException: I/O error on POST request for "http://localhost:10002/jobs": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:675)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:637)
at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:558)
at com.rapidminer.execution.jobagent.clients.rest.JobContainerRestClient.submitJob(JobContainerRestClient.java:69)
... 21 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at org.springframework.http.client.SimpleClientHttpResponse.getRawStatusCode(SimpleClientHttpResponse.java:52)
at org.springframework.web.client.DefaultResponseErrorHandler.hasError(DefaultResponseErrorHandler.java:54)
at org.springframework.web.client.RestTemplate.handleResponse(RestTemplate.java:697)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:662)
... 24 more
Log : (Last 1000 lines)
Last update was 7 minutes ago

This is the second time in 14 days that it has failed in this way so it's not a one-off. Any ideas where I should be looking to resolve this?

Many thanks,
Paul
Tagged:

Answers

  • aschaferdiek
    aschaferdiek New Altair Community Member
    edited August 2020
    This seems to be very similar to the following issue (but a light version of it) and also it might have the same reason: network/machine hick-up. Your machine might simply be overloaded and cannot accept the request coming from the Job Agent (submits a new job I guess, at least the POST speaks for that).
    Are your Job Containers restarted because something similar with a GET pops up several times in a row or do they recover and proceed normally afterwards?

  • PaulM
    PaulM New Altair Community Member
    Thanks for your reply @aschaferdiek. I re-ran the same process that failed and it completed fine. It ran every night since August 12th fine as well, until tonight (at 2am) it failed again - with the same error and in the same sub-process; one that loops through a repository folder reading and processing many datasets.

    However, I made an interesting discovery. The machine runs to job containers and in the other container there is a scheduled process that runs every 3 minutes and at 2am every night it fails when it attempts to write to an external MySQL database. I need to talk to the DB team but maybe they run some kind of backup process or similar on that database at that time which causes that error. 

    The fact that both of these issues happen at 2am can't be coincidence. During that error would RM be dumping a lot of error logging to the file system which could cause the intermittent failure in the other job container running on the same machine?

    Thanks,
    Paul
  • aschaferdiek
    aschaferdiek New Altair Community Member
    That both errors pop up at the same time might be an indicator, yes.
    RapidMiner Server/AI Hub is not dumping massive amounts of error logs or something like that, just the one error.log file in the Job Agents jobs directory, but I have no knowledge what the process is doing or if it's resource heavy. To verify, you could have set it up on a different machine maybe?