Basics of Preemptive Scheduling

Introduction

Altair PBS Professional is a fast, powerful workload manager designed to improve productivity, optimize utilization and efficiency, and simplify administration for HPC clusters, clouds, and supercomputers. PBS Professional automates job scheduling, management, monitoring, and reporting, and it's the trusted solution for complex Top500 systems as well as smaller clusters.

Scheduling Policy

Scheduling policies are algorithms which are used to allocate system resources (CPU, memory, etc.) to tasks, improving efficiency of execution and reducing delays or wait times. PBS Professional has many scheduling policies that are used based on the requirement. Some of the frequently used policies are:

FIFO (First-In-First-Out)
Strict Scheduling
Backfilling
Preemption
Fairshare

Definition

Preemption is the act of temporarily interrupting or preventing the execution of low-priority tasks in order to provide enough resources to the high-priority tasks for execution. There are basically four different situations in which tasks or jobs can be interrupted, and these will be discussed in the subsequent sections.

Challenges

In certain scenarios where we need high-priority tasks to execute immediately, we can use preemption to set aside the normal scheduling policy to run the higher-priority jobs. Suppose there is a critical task which needs to be completed for a delivery or to meet some targets and the cluster is filled by low-priority jobs. In these cases, preemption can be very helpful. Preemption allows high-priority tasks to acquire resources by suspending, requeuing, checkpointing or deleting one or more lower-priority tasks.

Scheduling policies (i.e., FIFO, backfilling, fairshare) are not preemptive by default. These policies only increase the priority of the jobs (job execution priority), which in turn increase the probability of execution of certain tasks. Preemption can be used along with a scheduling policy as a sidecar to provide more control on job execution. Preemption can sometimes lead to starvation of resources for the lower-priority jobs.

Configuring Preemption

Preemption policy is always used with other scheduling policies (i.e., FIFO, backfilling, fairshare). By default, the jobs in a cluster are not categorized into levels. When preemption is enabled, then the jobs are divided into different preemption levels (i.e., express_queue, normal_queue). Preemption level of a job depends on the queue priority (queue in which the job is submitted) or the type of job (i.e., reservation, preempted).

Preemption level is a characteristic of the job which determines the preemption priority (which job will take precedence over the other).

By default, there are two preemption levels, namely express_queue and normal_jobs, but we can use other preemption levels depending on the policy/attributes configured and defined in the preempt_prio scheduler parameter. Preemption level is mainly determined by the queue in which the job is submitted. The queues can be divided into normal and express queues based on the queue_priority and value of preempt_queue_prio. By default, all queues are defined as normal queues. Queues where the priority has been set greater than preempt_queue_prio (default 150) are considered express queues and the jobs submitted to these queues belong to an express job class.

There are two ways to implement preemption, by using:

Job preemption priority: A job that has a higher preemption priority can preempt jobs having a lower preemption priority.
Preemption targets: This provides more granular control over which jobs can be preempted. The jobs which have a higher preemption priority can preempt jobs which are preemptable (determined by the value of preempt target) and have lower preemption priority.

PBS calculates two different priorities for jobs:

Job execution priority is calculated by default and determines the order in which the jobs are executed in a queue.
Job preemption priority determines which jobs can preempt other jobs to acquire the resources and execute.

Job execution and job preemption priority are independent of each other. Job execution priority helps in determining the top job (job with highest execution priority or job next in line to be executed). If the scheduler cannot run the top job and preemption is enabled, then the scheduler will check the top job’s preemption priority. If it cannot find enough low-priority jobs which can release sufficient resources to run the high-priority jobs, then it will not preempt any jobs.

Job Class

PBS groups jobs into different classes based on a scheduling policy called job class. There are four job classes. The jobs in each class are sorted based on the rules that are specific for that job class.

Jobs are sorted into the express class only when preemption is enabled. If the jobs have preemption priority higher than the normal jobs, then they are placed in the express class. The preemption priority of jobs is determined by the preempt_prio scheduler parameter, which defines a list of preemption levels and their relative priority to each other. The priority is determined by the order in which the levels are defined. All the jobs in the cluster belongs to different preemption levels.

preempt_prio = "express_queue, starving_jobs, normal_jobs"

In the above example, the jobs which are in preemption level express_queue and starving_jobs belong to express class.

Note: Jobs running in Reservation class have the highest job execution priority and cannot be preempted.

Class	Description	Sort Applied Within Class
Reservation	Jobs submitted to an advance or standing reservation	Formula, job sort key, submission time
Express	All jobs with preemption priority higher than normal jobs. Preemption priority is defined in the scheduler's preempt_prio parameter. Jobs are sorted into this class only when preemption is enabled.	First by preemption priority, then by preemption time, then starving time, then by formula or fairshare or job sort key, followed by job submission time
Preempted	All jobs that have been preempted.	First by preemption time, then starving time, then by formula or fairshare or job sort key, followed by job submission time
Normal	Jobs that do not belong in any of the special classes	Queue order, if it exists, then formula, fairshare, or job sort key, followed by job submission time

Preemption level

The default value for preempt_prio scheduler parameter is express_queue, normal_jobs, but other levels can be added depending on the policy configured for the site. If other preemption levels are not defined in preempt_prio, then by default the jobs belonging to other preemption levels apart from express_queue will be considered as normal jobs.

The jobs submitted in a high-priority queue (a queue with queue_priority greater than preempt_queue_prio) are grouped into express_queue and are often referred to as preempting or preemptive jobs.

preempt_prio = "express_queue, starving_jobs, normal_jobs"

The order in which the jobs belonging to different levels are considered for preemption is based on the value of the scheduler attribute preempt_prio.

Preemption Level	Description
express_queue	Jobs in an express queue
starving_jobs	Jobs that have exceeded the wait time
normal_jobs	Jobs that do not fit into any other levels
Fairshare	An entity owning a job exceeds its fairshare limit
queue_softlimits	Jobs that have exceeded their queue soft limits
server_softlimits	Jobs that have exceeded their server soft limits

Advantage

Sometimes jobs submitted in low-priority queues occupy resources in the cluster, which can delay the execution of high-priority or important jobs. Preemption helps in preventing such situations from occurring in the cluster.

If preemptive scheduling or preemption is enabled and many low-priority jobs are occupying cluster resources, preventing or delaying the execution of high-priority jobs, then the scheduler chooses and preempts one or many low priority jobs to release enough resources to enable the execution of high priority jobs. The preemptable jobs are selected based on the scheduler parameter preempt_sort attribute which accepts only min_time_since_start as a value. If the attribute is set (preempt_sort: min_time_since_start), then all the jobs that are eligible for preemption are sorted based on their start time, and the jobs with minimum start time or recently started jobs are selected for preemption. However, if the value is unset (i.e., the attribute is disabled), then the jobs with maximum start time or longest running jobs are selected for preemption.

Parameter vs Attributes

Parameters are set in the configuration file (sched_priv/sched_config or mom_priv/config) and the corresponding daemon (pbs_sched or pbs_mom) needs to be reloaded or restarted to make the changes.

In PBS, each daemon (pbs_server, pbs_sched) has an object which has a state and behavior. The state of an object is represented by data values or attributes, which provides a method to directly modify the object state. Each daemon exposes some of the attributes which can be modified directly without the need to restart the daemon. Most of the attributes are set using the qmgr command, where each attribute is listed by the object it is modifying.

qmgr -c "set <object> <attribute> = <value>"

qmgr -c "set server pbs_ default_queue = workq"

qmgr -c "set sched scheduling = True"

The characteristics of an attribute are:

The changes done on an attribute are instantaneous. It does not require the daemon to restart for changes to take effect.
The attributes, if unset, uses the default value.

Note: In PBS version ≤ 19.2.X preemption was configured through the scheduler parameters.

Configuration files

There are two configuration files that are used to set parameters for preemptive scheduling:

Scheduler configuration file.
MoM configuration file.

Scheduler Configuration

The Scheduler configuration file ($PBS_HOME/sched_priv/sched_config) is used to enable preemptive scheduling and the scheduler attributes are used for configuration. The substate and queue attribute of a job determines the preemption level of the job. The preemption level, in turn, determines the preemption priority or the order in which the jobs belonging to different levels are considered (preempting or preemptive job).

Scheduler parameters and attributes are used for configuring preemption are:

Parameters:

preemptive_sched: true ALL

Attributes:

set sched preempt_queue_prio = 150

set sched preempt_prio = "express_queue, normal_jobs"

set sched preempt_order = S

set sched preempt_sort = min_time_since_start

MoM Configuration

The MoM configuration file is used to specify the scripts to trigger (for checkpointing) or to change the state of jobs in response to the signals received from the server. The signals are sent from the server to the pbs_mom and the script corresponding to that signal is executed by the pbs_mom on the execution host. It can also be used to change the suspend and resume signal for jobs using the $suspendsig parameter.

Preemption Methods

There are four different ways that jobs can be preempted:

Suspend (S)
Checkpoint (C)
Requeue (R)
Deletion (D)

Some applications do not have the capability to checkpoint, and resume or recover the state of the job from the information(state) stored in a file. However, PBSPro provides the flexibility to use multiple preemption methods (SCR) together for different job stages or the amount of completion of the job. The preemption method is set using the preempt_order scheduler attribute and the default is SCR.

The preemption method can be chosen depending on the site policy and the nature of applications running on the cluster. The preemption method is applied to all the jobs submitted in the cluster, so the method for preemption should be considered carefully, taking all scenarios into account. Some applications do not have the capability to checkpoint, and resume or recover the state of the job from the information(state) stored in a file. However, PBSPro provides the flexibility to use multiple preemption methods (SCR) together for different job stages or the amount of completion of the job. The preemption method is set using the preempt_order scheduler attribute and the default is SCR.

Suspend (S)

Scenario: If the application has just started executing and the site chooses to preserve the state of the job and resume from the same state.

Pros:

The job state is preserved so the CPU time consumed by the job is not wasted.

Cons:

This is very resource-intensive as the resource (scratch space) is not released even after preemption of the job.

Checkpoint (C)

Scenario: When an application running in the cluster has the capability to store the current state of the application and use that information to continue processing when execution resumes.

Pros:

Helps reduce the execution time of the jobs.
Releases all the resources consumed by the job during execution.

Cons:

Dependent on applications capabilities. Most of the applications do not provide the feature for checkpointing and resuming.

Requeue (R)

Scenario: When jobs running in the cluster are small jobs and do not require much CPU time for completion or when jobs run for a short duration.

Pros:

Releases all the resources consumed by the job during execution.

Cons:

The CPU time consumed by the preempted job is wasted as the job is executed from the beginning.

Deletion (D)

Scenario: When the jobs running in the cluster are no longer required, or when other methods for preemption have failed to release the resources occupied by the lower-priority job, or other jobs require a clean system to run.

Pros:

Resources are immediately released.

Cons:

CPU time consumed by preempted job is wasted.
Job is completely removed from the system and the user must re-queue the job themselves.

Configuring Preemption Method - Suspend

Now let’s set up preemption in a test environment and analyze how preemption works with the default options.

Steps to configure:

Stop scheduling:

# qmgr -c "set sched scheduling=false"

Enable preemption in sched_config:

Update $PBS_HOME/sched_priv/sched_config

preemptive_sched: true ALL

Update scheduler attributes using qmgr.

To verify the default configuration of the scheduler:

qmgr -c “print sched”

qmgr -c "set sched preempt_queue_prio = 150"

qmgr -c "set sched preempt_prio = 'express_queue, normal_jobs'"

qmgr -c "set sched preempt_order = S"

qmgr -c "set sched preempt_sort = min_time_since_start"

Preemptive scheduling has been enabled in the scheduler configuration.
preempt_queue_prio has been set to 150(default) so the queues defined with priority greater than 150 will be express queues.
The preempt_prio is by default set to "express_queue, normal_jobs".
The preempt order is set to Suspend(S), so the scheduler will preempt the jobs by suspending it.
preempt_sort has been set to min_time_since_start, so recently submitted jobs are chosen for preemption.

Load the scheduler configuration parameter

# kill -HUP $(pgrep -f pbs_sched)

Create two queues for testing

# qmgr -c "create queue expressq queue_type=execution,started=true,enabled=true,priority=200"

# qmgr -c "create queue normalq queue_type=execution,started=true,enabled=true"

Start Scheduling

# qmgr -c "set sched scheduling=true"

Test the Environment

Queue configuration

[root@master ~]# qstat -Qf expressq normalq

Queue: expressq

queue_type = Execution

Priority = 200

total_jobs = 0

state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0

enabled = True

started = True

Queue: normalq

queue_type = Execution

total_jobs = 0

state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0

enabled = True

started = True

Node configuration:

[root@master ~]# pbsnodes -Sav

vnode state OS hardware host queue mem ncpus nmics ngpus comment

--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------

master free -- -- master -- 0 b 0 0 0 --

master_node1 free -- -- master -- 2gb 2 0 0 --

master_node2 free -- -- master -- 2gb 2 0 0 --

master_node3 free -- -- master -- 2gb 2 0 0 --

master_node4 free -- -- master -- 2gb 2 0 0 --

Testing

Submit a few jobs in a low-priority queue (normalq) to fill up the cluster and to create the circumstances for preemption.

Without Preemption

At T0

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19492.master test_user normalq low_prio_job 11316 1 2 -- 00:02 R 00:00

19493.master test_user normalq low_prio_job 11318 1 2 -- 00:04 R 00:00

19494.master test_user normalq low_prio_job 11319 1 1 -- 00:04 R 00:00

19495.master test_user normalq low_prio_job 11320 1 3 -- 00:05 R 00:00

At T1:

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19492.master test_user normalq low_prio_job 11316 1 2 -- 00:02 R 00:01

19493.master test_user normalq low_prio_job 11318 1 2 -- 00:04 R 00:01

19494.master test_user normalq low_prio_job 11319 1 1 -- 00:04 R 00:01

19495.master test_user normalq low_prio_job 11320 1 3 -- 00:05 R 00:01

At T2

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19492.master test_user normalq low_prio_job 11316 1 2 -- 00:02 R 00:02

19493.master test_user normalq low_prio_job 11318 1 2 -- 00:04 R 00:02

19494.master test_user normalq low_prio_job 11319 1 1 -- 00:04 R 00:02

19495.master test_user normalq low_prio_job 11320 1 3 -- 00:05 R 00:02

19496.master test_user expressq urgent_job -- 1 6 -- 00:02 Q --

At T3

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19493.master test_user normalq low_prio_job 11318 1 2 -- 00:04 R 00:03

19494.master test_user normalq low_prio_job 11319 1 1 -- 00:04 R 00:03

19495.master test_user normalq low_prio_job 11320 1 3 -- 00:05 R 00:03

19496.master test_user expressq urgent_job -- 1 6 -- 00:02 Q --

At T4

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19493.master test_user normalq low_prio_job 11318 1 2 -- 00:04 R 00:04

19494.master test_user normalq low_prio_job 11319 1 1 -- 00:04 R 00:04

19495.master test_user normalq low_prio_job 11320 1 3 -- 00:05 R 00:04

19496.master test_user expressq urgent_job -- 1 6 -- 00:02 Q --

At T5

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19495.master test_user normalq low_prio_job 11320 1 3 -- 00:05 R 00:05

19496.master test_user expressq urgent_job -- 1 6 -- 00:02 Q --

At T6

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19496.master test_user expressq urgent_job -- 1 6 -- 00:02 R 00:00

Analysis

Without preemption, the jobs are executed in the same sequence in which they are being submitted to the cluster. The high-priority (urgent) job must wait until all the jobs are finished and enough resources are released for its execution.

With Preemption

At T0

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19689.master test_user normalq low_prio_job 11737 1 2 -- 00:02 R 00:00

19690.master test_user normalq low_prio_job 11741 1 2 -- 00:04 R 00:00

19691.master test_user normalq low_prio_job 11747 1 1 -- 00:04 R 00:00

19692.master test_user normalq low_prio_job 11755 1 3 -- 00:05 R 00:00

At T1

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19689.master test_user normalq low_prio_job 11737 1 2 -- 00:02 R 00:01

19690.master test_user normalq low_prio_job 11741 1 2 -- 00:04 R 00:01

19691.master test_user normalq low_prio_job 11747 1 1 -- 00:04 R 00:01

19692.master test_user normalq low_prio_job 11755 1 3 -- 00:05 R 00:01

Currently there are no free resources available in the cluster to run new jobs. So, if we submit a new job into high priority queue(expressq) then it should acquire resources from low-priority jobs to execute.

At T2

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19689.master test_user normalq low_prio_job 11737 1 2 -- 00:02 R 00:02

19690.master test_user normalq low_prio_job 11741 1 2 -- 00:04 S 00:01

19691.master test_user normalq low_prio_job 11747 1 1 -- 00:04 S 00:01

19692.master test_user normalq low_prio_job 11755 1 3 -- 00:05 S 00:01

19693.master test_user expressq urgent_job 11802 1 6 -- 00:02 R 00:01

At T3

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19690.master test_user normalq low_prio_job 11741 1 2 -- 00:04 S 00:01

19691.master test_user normalq low_prio_job 11747 1 1 -- 00:04 S 00:01

19692.master test_user normalq low_prio_job 11755 1 3 -- 00:05 S 00:01

19693.master test_user expressq urgent_job 11802 1 6 -- 00:02 R 00:02

At T4

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19690.master test_user normalq low_prio_job 11741 1 2 -- 00:04 R 00:02

19691.master test_user normalq low_prio_job 11747 1 1 -- 00:04 R 00:02

19692.master test_user normalq low_prio_job 11755 1 3 -- 00:05 R 00:02

Analysis

We submitted 5 jobs to the cluster and jobs 19689.master,19690.master, 19691.master and 19692.master belong to the normal_jobs preemption level where as job 19693.master belongs to the express_queue preemption level and Express job class. As express_queue is defined before normal_jobs via the preempt_prio parameter, the jobs in the express_queue (19693.master) can preempt jobs belonging to normal_jobs. The scheduler parameter preempt_sort has been set to minimum_time_since_start. Hence, all the jobs which are eligible for preemption (normal_jobs) are sorted based on the job’s start time and those job with minimum run time (19690.master, 19691.master and 19692.master) are selected for preemption.

Conclusion

Preemption or preemptive scheduling provides the capability to control the execution of jobs based on the priority of the tasks (job execution priority). There are four different methods in which the resources from low-priority jobs can be acquired. This article covers the basic configuration for implementing preemption.

References

PBS Professional 2021.1 Admin Guide