Troubleshoot a slow vovserver response (NetworkComputer)
Usually, the NC vovserver response time is very fast and is unnoticeable, even when handling hundreds to thousands of clients and with millions of jobs queued.
But you may encounter a noticeable lag when submitting jobs or checking their status. This can be caused by processes on the NC vovserver machine, workload conditions, jobs on the subsystems, scheduler conditions or vovserver issues.
The vovserver contains a number of subsystems that do the following:
- Handle incoming jobs and place them into jobqueue buckets of similar jobs.
- Match jobs with resources and place them on vovslave machines to be run.
- Manage all client connections to vovslaves, NC GUI and monitors, HTTP browsers.
- Manage the jobs database.
Evaluate potential issues within each subsystem accordingly.
You should also evaluate the scheduler. Several conditions can contribute to extra work for the scheduler:
- A large number of jobqueue buckets (more than about 100)
- Jobs where resources requested contain deep OR expressions (5 or more)
- A large number of notify clients
To resolve potential issues, perform the following steps:
- Use the uptime command to check the load average on the NC vovserver machine. If the machine is loaded by other programs, less processing power is available to vovserver. It is best for the NC's vovserver to have a dedicated host.
- Use the ps command to see whether other processes on the NC vovserver machine are consuming unusual amounts of CPU time, memory, I/O or other shared resources. Note the percentage of CPU consumed by vovserver, and if it is near 100%, continue with these steps.
- View the Admin->Clients page, if possible, or check the number of clients from the command line with vovshow -clients.
- View the Jobqueue buckets page and note the number of buckets. If the number of buckets is over 200, investigate how jobs are being submitted. Consider using vnc_policy to quantize RAM/ and other requests to reduce the number of buckets.
- Use the RTDA vsi command to get info about the vovserver:
% nc cmd vsi
This will show the vovserver memory consumption, the number of job slots, buckets and other important statistics about vovserver. The memory consumption is typically about 1KB per node in the flow graph. If it's more than 3-4 times this, investigate why.
- Check the numbers of files relative to total jobs in the system, and the number of 'done' and 'failed' jobs.
If the number of done jobs is a significant fraction of total jobs in the system, check the values of the vovserver autoforget parameters. If jobs need to be retained for analysis, consider using the -keepfor option with 'nc run'. Also consider using -depset and 'vovset count' to get the statistics of sets rather than 'nc list'.
- Check for large numbers of empty sets. Sometimes automated submission scripts will put jobs into a set. The jobs are then auto-forgotten after a time, but the sets remain. Use the command vovforget -emptysets to clean up.
The following items need to be done by the NC owner or by a privileged (root) account:
Strace the vovserver process to get an idea where it is spending time. Please compress strace output if sending to customer support; it can be large.
% timeout --signal=INT 10 strace -ttt -T -p 16201 > strace.log
Use the 'vovpstack' command (based on gdb), gdb or pstack (more dangerous) to find the routines vovserver is calling; repeat 5-10 times while vovserver is exhibiting a slow response:
% gdb -batch -ex bt -p 16201
This output is small enough to email to customer support uncompressed. If using pstack() DO NOT INTERRUPT with ^C, it may kill vovserver. Normal vovserver will show poll() in the main event loop. Presence of generator() or applyNodeSet() indicates vovserver handling queries like 'nc list' or vtk_node_format'. Try to reduce queries or limit their scope to a given set, rather than System:jobs.
- Another kind of slowness is associated with client connection timeouts. When vovserver is receiving incoming TCP connections at a very high rate, the listening queue becomes saturated and new connections are not accepted. The ss command is useful in this case.