Use vovreconciled to Release Resources Unused by Jobs (NetworkComputer)


Occasionally, jobs request resources, but don't use them. For resources that are in short supply, this can prevent legitimate jobs from running, because the resources are unavailable. In this case, you must track down jobs that are requesting the resource, but not using them and then ask them to relinquish the resources. This is a very time-consuming process that does not guarantee success.

Runtime provides a way to handle this automatically, called "reconciliation." A daemon called vovreconciled runs periodically (loopTime). It checks jobs for unused resources and then removes them after a certain amount of time (revokeDelay). This is mainly used for licenses, because they are the most expensive and limited resources. The reconciliation can be set up for individual licenses or job classes or globally or all three.

To setup reconcialition, perform the following steps:

1. Create a configuration file in a vovreconciliation directory in the project swd directory.

2. Setup the vovreconcile daemon. 

3. The files you will need to setup and test vovreconciled are included at the bottom of this file. You will need to perform the following steps: 

### setup and start vovreconciled
cd vnc.swd
mkdir vovreconciled
cp .../vovreconciled_config.tcl vovreconciled/config.tcl
cp .../start_vovreconciled.tcl autostart/start_vovreconciled.tcl

### you will also need to edit the resources.tcl file to disable the
### resources daemon license revocation by adding the following line.
echo "set RESD(doResRevoke) 0" >> ./resources.tcl

### stop and restart nc to start the vovreconciled daemon
ncmgr stop
ncmgr start
cd vnc.swd/vovreconciled
### look at the log file to make sure there are no errors running vovreconciled
more vovreconciled.log

### test vovreconciled is working
### create a test directory
mkdir ~/testdir
cd    ~/testdir
cp .../setupNC.csh .
cp .../test_vovreconciled1.csh .
chmod 755 setupNC.csh test_vovreconciled1.csh
### edit the setupNC.csh file to set paths and user ID's
### and the nc queue name for your site

### to run the test
./test_vovreconciled1.csh

The test will start a sleep job that runs for 10 minutes.

The vovreconciled daemon is checking all running jobs every 30 seconds. When a job has run for 5 minutes and has a license that has not been checked out that job is flagged to have that resource removed.

You will see the following lines (or something very close to the following lines) as it checks jobs.

vovreconciled 03/01/2016 22:41:14: msg-3: Checking revocation for all running jobs
vovreconciled 03/01/2016 22:41:14: msg-4: Checking revocation for RETRACING 000084874 323 {} {License:Verilog-XL#1 Priority:normal#1 User:csg#1 Group:dv#1} {} {}
vovreconciled 03/01/2016 22:41:14: msg-3: Checking revocation for 000084874  age=323
Grabbed: License:Verilog-XL#1 Priority:normal#1 User:csg#1 Group:dv#1
vovreconciled 06/01/2016 22:41:14: msg-3: FindLeastDelay
AggressiveClass:  10000000
REVOKE_DELAY in class:  10000000
REVOKE_DELAY in ResMap:  10000000
Global:  300
vovreconciled 03/01/2016 22:41:14: msg-3: Revoke delay for res=License:Verilog-XL in jobclass= is 5m00s
vovreconciled 03/01/2016 22:41:14: message: Save info about revoking 1 tokens of License:Verilog-XL from 000084874
vovreconciled 03/01/2016 22:41:14: msg-3: Checking reassignment for all running jobs
vovreconciled 03/01/2016 22:41:14: msg-4:     Checking reassigning for RETRACING 000084874 {} {}
vovreconciled 03/01/2016 22:41:14: msg-3: Sleeping for 30s
vovreconciled 03/01/2016 22:41:44: msg-3: Checking revocation for all running jobs
vovreconciled 03/01/2016 22:41:44: msg-4: Checking revocation for RETRACING 000084874 353 {} {License:Verilog-XL#1 Priority:normal#1 User:csg#1 Group:dv#1} {} {}
vovreconciled 03/01/2016 22:41:44: msg-3: Checking revocation for 000084874  age=353
Grabbed: License:Verilog-XL#1 Priority:normal#1 User:csg#1 Group:dv#1
vovreconciled 03/01/2016 22:41:44: msg-3: FindLeastDelay
AggressiveClass:  10000000
REVOKE_DELAY in class:  10000000
REVOKE_DELAY in ResMap:  10000000
Global:  300

at this point double click on the job in the nc gui and when the job info form
open us you will see the following property is now on the job.
    "PROP CHANGEGRAB=1433333764 License:Verilog-XL -1"
the job is no longer reserving this license.

###
### F I L E S    F O R   V O V R C O N C I L E D
###

vovreconciled_config.tcl
==========
### in vovreconciled/config.tcl
### Optional: set revocation delay for specific Licenses
#vtk_resourcemap_set_revocation_delay License:abc 2m

### Optional: set revocation delay for specific job classes.
#vtk_jobclass_set_revocation_delay  Regression 5m

### set the revocation delay globally
set RESD(revokeDelay) 5m

### set the loop time for the checking. default is 30 seconds
set RESD(loopTime)    30
==========

start_vovreconciled.tcl
==========
# This is the autostart script for vovreconciled.
# We simply assume that everything is setup correctly:
#  The directory XXX.swd/vovreconciled exists
#  The file      XXX.swd/vovreconciled/config.tcl exists
#
set daemonName "vovreconciled"
VovMessage "Launching $daemonName"

indir [vovGetProjectFileName $daemonName] {
    if [catch {exec $daemonName -n -v -v -v -v -v  >>& $daemonName.log &} errmsg] {
VovFatalError "Cannot start $daemonName: $errmsg"
    }
}
exit 0
==========

setupNC.csh
==========
### you need to edit this file to correct these variables
### so they point to your directory structure and your queue name and
### admin user name

### the path to your RTDA installation version example /tools/rtda/2016.03
set rtdaVersionDir="/tools/rtda/2016.03"
### the file to setup your rtda version
set setupFile="${rtdaVersionDir}/common/etc/vovrc.csh"
### the vnc swd directory path
set vncswd="${rtdaVersionDir}/../vnc"
### the rtda queue name (default is "vnc"
set queueName="vnc"
### the admin user name, needed so vovproject enable will work
set adminUser="batadmin"
### The license we want to test with, it must exist in your vnc resources
set license="Verilog-XL"
set fsLicense=`echo ${license} | tr '-' '_'`

### check out all directories and paths

if ( -d "${rtdaVersionDir}" ) then
    if ( -f "${setupFile}"  ) then
        if ( -d "${vncswd}" ) then
     if ( -d "${vncswd}/${queueName}.swd" ) then
  ### everything checks out so setup NC

  ### source the vovrc.csh file for your cluster
  source ${setupFile}

  ### setup the NC_QUEUE variable so things work smoothly
  setenv NC_QUEUE ${queueName}

  ### setup the VNCSWD variable for the swd location
  setenv VNCSWD ${vncswd}

  ### enable the project so vov commands work from the command line
  vovproject enable -u ${adminUser} ${NC_QUEUE}
     else
  echo "-E-: rtda vnc project swd directory does not exist, ${vncswd}/${queueName}.swd"
     endif
else
     echo "-E-: rtda vnc directory does not exist, ${vncswd}"
endif
    else
echo "-E-: rtda environemnt setup file does not exist, ${setupFile}"
    endif
else
    echo "-E-: rtda version directory does not exist, ${rtdaVersionDir}"
endif
==========

test_vovreconciled1.csh
==========
#!/bin/csh -f

### source the setup file
source ./setupNC.csh

## open up the gui to see all your jobs
#nc gui &

### forget all previous jobs that may be running
nc forget -mine -forcerunning

### submit the job
### which request a license but doesn't use it
nc run -p 4 -x 600 -fstokens 1 -nodb -nolog -r RAM/10 CORES/1 License:${license} -- sleep 600

### tail the reconcile daemon log file to see that the license
### reservation is revoked
tail -f ${vncswd}/${NC_QUEUE}.swd/vovreconciled/vovreconciled.log
==========