Use vovreconciled to Release Resources Unused by Jobs (NetworkComputer)
Occasionally, jobs request resources, but don't use them. For resources that are in short supply, this can prevent legitimate jobs from running, because the resources are unavailable. In this case, you must track down jobs that are requesting the resource, but not using them and then ask them to relinquish the resources. This is a very time-consuming process that does not guarantee success.
Runtime provides a way to handle this automatically, called "reconciliation." A daemon called vovreconciled runs periodically (loopTime). It checks jobs for unused resources and then removes them after a certain amount of time (revokeDelay). This is mainly used for licenses, because they are the most expensive and limited resources. The reconciliation can be set up for individual licenses or job classes or globally or all three.
To setup reconcialition, perform the following steps:
1. Create a configuration file in a vovreconciliation directory in the project swd directory.
2. Setup the vovreconcile daemon.
3. The files you will need to setup and test vovreconciled are included at the bottom of this file. You will need to perform the following steps:
### setup and start vovreconciled cd vnc.swd mkdir vovreconciled cp .../vovreconciled_config.tcl vovreconciled/config.tcl cp .../start_vovreconciled.tcl autostart/start_vovreconciled.tcl ### you will also need to edit the resources.tcl file to disable the ### resources daemon license revocation by adding the following line. echo "set RESD(doResRevoke) 0" >> ./resources.tcl ### stop and restart nc to start the vovreconciled daemon ncmgr stop ncmgr start cd vnc.swd/vovreconciled ### look at the log file to make sure there are no errors running vovreconciled more vovreconciled.log ### test vovreconciled is working ### create a test directory mkdir ~/testdir cd ~/testdir cp .../setupNC.csh . cp .../test_vovreconciled1.csh . chmod 755 setupNC.csh test_vovreconciled1.csh ### edit the setupNC.csh file to set paths and user ID's ### and the nc queue name for your site ### to run the test ./test_vovreconciled1.csh
The test will start a sleep job that runs for 10 minutes.
The vovreconciled daemon is checking all running jobs every 30 seconds. When a job has run for 5 minutes and has a license that has not been checked out that job is flagged to have that resource removed.
You will see the following lines (or something very close to the following lines) as it checks jobs.
vovreconciled 03/01/2016 22:41:14: msg-3: Checking revocation for all running jobs vovreconciled 03/01/2016 22:41:14: msg-4: Checking revocation for RETRACING 000084874 323 {} {License:Verilog-XL#1 Priority:normal#1 User:csg#1 Group:dv#1} {} {} vovreconciled 03/01/2016 22:41:14: msg-3: Checking revocation for 000084874 age=323 Grabbed: License:Verilog-XL#1 Priority:normal#1 User:csg#1 Group:dv#1 vovreconciled 06/01/2016 22:41:14: msg-3: FindLeastDelay AggressiveClass: 10000000 REVOKE_DELAY in class: 10000000 REVOKE_DELAY in ResMap: 10000000 Global: 300 vovreconciled 03/01/2016 22:41:14: msg-3: Revoke delay for res=License:Verilog-XL in jobclass= is 5m00s vovreconciled 03/01/2016 22:41:14: message: Save info about revoking 1 tokens of License:Verilog-XL from 000084874 vovreconciled 03/01/2016 22:41:14: msg-3: Checking reassignment for all running jobs vovreconciled 03/01/2016 22:41:14: msg-4: Checking reassigning for RETRACING 000084874 {} {} vovreconciled 03/01/2016 22:41:14: msg-3: Sleeping for 30s vovreconciled 03/01/2016 22:41:44: msg-3: Checking revocation for all running jobs vovreconciled 03/01/2016 22:41:44: msg-4: Checking revocation for RETRACING 000084874 353 {} {License:Verilog-XL#1 Priority:normal#1 User:csg#1 Group:dv#1} {} {} vovreconciled 03/01/2016 22:41:44: msg-3: Checking revocation for 000084874 age=353 Grabbed: License:Verilog-XL#1 Priority:normal#1 User:csg#1 Group:dv#1 vovreconciled 03/01/2016 22:41:44: msg-3: FindLeastDelay AggressiveClass: 10000000 REVOKE_DELAY in class: 10000000 REVOKE_DELAY in ResMap: 10000000 Global: 300 at this point double click on the job in the nc gui and when the job info form open us you will see the following property is now on the job. "PROP CHANGEGRAB=1433333764 License:Verilog-XL -1" the job is no longer reserving this license. ### ### F I L E S F O R V O V R C O N C I L E D ### vovreconciled_config.tcl ========== ### in vovreconciled/config.tcl ### Optional: set revocation delay for specific Licenses #vtk_resourcemap_set_revocation_delay License:abc 2m ### Optional: set revocation delay for specific job classes. #vtk_jobclass_set_revocation_delay Regression 5m ### set the revocation delay globally set RESD(revokeDelay) 5m ### set the loop time for the checking. default is 30 seconds set RESD(loopTime) 30 ========== start_vovreconciled.tcl ========== # This is the autostart script for vovreconciled. # We simply assume that everything is setup correctly: # The directory XXX.swd/vovreconciled exists # The file XXX.swd/vovreconciled/config.tcl exists # set daemonName "vovreconciled" VovMessage "Launching $daemonName" indir [vovGetProjectFileName $daemonName] { if [catch {exec $daemonName -n -v -v -v -v -v >>& $daemonName.log &} errmsg] { VovFatalError "Cannot start $daemonName: $errmsg" } } exit 0 ========== setupNC.csh ========== ### you need to edit this file to correct these variables ### so they point to your directory structure and your queue name and ### admin user name ### the path to your RTDA installation version example /tools/rtda/2016.03 set rtdaVersionDir="/tools/rtda/2016.03" ### the file to setup your rtda version set setupFile="${rtdaVersionDir}/common/etc/vovrc.csh" ### the vnc swd directory path set vncswd="${rtdaVersionDir}/../vnc" ### the rtda queue name (default is "vnc" set queueName="vnc" ### the admin user name, needed so vovproject enable will work set adminUser="batadmin" ### The license we want to test with, it must exist in your vnc resources set license="Verilog-XL" set fsLicense=`echo ${license} | tr '-' '_'` ### check out all directories and paths if ( -d "${rtdaVersionDir}" ) then if ( -f "${setupFile}" ) then if ( -d "${vncswd}" ) then if ( -d "${vncswd}/${queueName}.swd" ) then ### everything checks out so setup NC ### source the vovrc.csh file for your cluster source ${setupFile} ### setup the NC_QUEUE variable so things work smoothly setenv NC_QUEUE ${queueName} ### setup the VNCSWD variable for the swd location setenv VNCSWD ${vncswd} ### enable the project so vov commands work from the command line vovproject enable -u ${adminUser} ${NC_QUEUE} else echo "-E-: rtda vnc project swd directory does not exist, ${vncswd}/${queueName}.swd" endif else echo "-E-: rtda vnc directory does not exist, ${vncswd}" endif else echo "-E-: rtda environemnt setup file does not exist, ${setupFile}" endif else echo "-E-: rtda version directory does not exist, ${rtdaVersionDir}" endif ========== test_vovreconciled1.csh ========== #!/bin/csh -f ### source the setup file source ./setupNC.csh ## open up the gui to see all your jobs #nc gui & ### forget all previous jobs that may be running nc forget -mine -forcerunning ### submit the job ### which request a license but doesn't use it nc run -p 4 -x 600 -fstokens 1 -nodb -nolog -r RAM/10 CORES/1 License:${license} -- sleep 600 ### tail the reconcile daemon log file to see that the license ### reservation is revoked tail -f ${vncswd}/${NC_QUEUE}.swd/vovreconciled/vovreconciled.log ==========