Basics of Preemptive Scheduling


Introduction

 

Altair PBS Professional is a fast, powerful workload manager designed to improve productivity, optimize utilization and efficiency, and simplify administration for HPC clusters, clouds, and supercomputers. PBS Professional automates job scheduling, management, monitoring, and reporting, and it's the trusted solution for complex Top500 systems as well as smaller clusters.

 

Scheduling Policy

Scheduling policies are algorithms which are used to allocate system resources (CPU, memory, etc.) to tasks, improving efficiency of execution and reducing delays or wait times. PBS Professional has many scheduling policies that are used based on the requirement. Some of the frequently used policies are:

 

Definition

Preemption is the act of temporarily interrupting or preventing the execution of low-priority tasks in order to provide enough resources to the high-priority tasks for execution. There are basically four different situations in which tasks or jobs can be interrupted, and these will be discussed in the subsequent sections.

 

Challenges

In certain scenarios where we need high-priority tasks to execute immediately, we can use preemption to set aside the normal scheduling policy to run the higher-priority jobs. Suppose there is a critical task which needs to be completed for a delivery or to meet some targets and the cluster is filled by low-priority jobs. In these cases, preemption can be very helpful. Preemption allows high-priority tasks to acquire resources by suspending, requeuing, checkpointing or deleting one or more lower-priority tasks.


Scheduling policies (i.e., FIFO, backfilling, fairshare) are not preemptive by default. These policies only increase the priority of the jobs (job execution priority), which in turn increase the probability of execution of certain tasks. Preemption can be used along with a scheduling policy as a sidecar to provide more control on job execution. Preemption can sometimes lead to starvation of resources for the lower-priority jobs.   

 

Configuring Preemption

Preemption policy is always used with other scheduling policies (i.e., FIFO, backfilling, fairshare). By default, the jobs in a cluster are not categorized into levels. When preemption is enabled, then the jobs are divided into different preemption levels (i.e., express_queue, normal_queue). Preemption level of a job depends on the queue priority (queue in which the job is submitted) or the type of job (i.e., reservation, preempted).

Preemption level is a characteristic of the job which determines the preemption priority (which job will take precedence over the other).

By default, there are two preemption levels, namely express_queue and normal_jobs, but we can use other preemption levels depending on the policy/attributes configured and defined in the preempt_prio scheduler parameter. Preemption level is mainly determined by the queue in which the job is submitted. The queues can be divided into normal and express queues based on the queue_priority and value of preempt_queue_prio. By default, all queues are defined as normal queues. Queues where the priority has been set greater than preempt_queue_prio (default 150) are considered express queues and the jobs submitted to these queues belong to an express job class.

 

There are two ways to implement preemption, by using:

 PBS calculates two different priorities for jobs:

 

Job execution and job preemption priority are independent of each other. Job execution priority helps in determining the top job (job with highest execution priority or job next in line to be executed). If the scheduler cannot run the top job and preemption is enabled, then the scheduler will check the top job’s preemption priority. If it cannot find enough low-priority jobs which can release sufficient resources to run the high-priority jobs, then it will not preempt any jobs.

 

Job Class

PBS groups jobs into different classes based on a scheduling policy called job class. There are four job classes. The jobs in each class are sorted based on the rules that are specific for that job class.

Jobs are sorted into the express class only when preemption is enabled. If the jobs have preemption priority higher than the normal jobs, then they are placed in the express class. The preemption priority of jobs is determined by the preempt_prio scheduler parameter, which defines a list of preemption levels and their relative priority to each other. The priority is determined by the order in which the levels are defined. All the jobs in the cluster belongs to different preemption levels. 

 

preempt_prio = "express_queue, starving_jobs, normal_jobs"

 

In the above example, the jobs which are in preemption level express_queue and starving_jobs belong to express class.

 

Note: Jobs running in Reservation class have the highest job execution priority and cannot be preempted.

 

Class

Description

Sort Applied Within Class

Reservation

Jobs submitted to an advance or standing reservation

Formula, job sort key, submission time         

Express

All jobs with preemption priority higher than normal jobs. Preemption priority is defined in the scheduler's preempt_prio parameter. 

Jobs are sorted into this class only when preemption is enabled.      

First by preemption priority, then by preemption time, then starving time, then by formula or fairshare or job sort key, followed by job submission time

Preempted

All jobs that have been preempted.

First by preemption time, then starving time, then by formula or fairshare or job sort key, followed by job submission time

Normal

Jobs that do not belong in any of the special classes

Queue order, if it exists, then formula, fairshare, or job sort key, followed by job submission time

 

Preemption level

The default value for preempt_prio scheduler parameter is express_queue, normal_jobs, but other levels can be added depending on the policy configured for the site. If other preemption levels are not defined in preempt_prio, then by default the jobs belonging to other preemption levels apart from express_queue will be considered as normal jobs.


The jobs submitted in a high-priority queue (a queue with queue_priority greater than preempt_queue_prio) are grouped into express_queue and are often referred to as preempting or preemptive jobs.


preempt_prio = "express_queue, starving_jobs, normal_jobs"


The order in which the jobs belonging to different levels are considered for preemption is based on the value of the scheduler attribute preempt_prio

 

Preemption Level

Description

    express_queue

Jobs in an express queue

    starving_jobs

Jobs that have exceeded the wait time

    normal_jobs

Jobs that do not fit into any other levels

    Fairshare

An entity owning a job exceeds its fairshare limit

    queue_softlimits

Jobs that have exceeded their queue soft limits

     server_softlimits

Jobs that have exceeded their server soft limits

 

Advantage

Sometimes jobs submitted in low-priority queues occupy resources in the cluster, which can delay the execution of high-priority or important jobs. Preemption helps in preventing such situations from occurring in the cluster.

If preemptive scheduling or preemption is enabled and many low-priority jobs are occupying cluster resources, preventing or delaying the execution of high-priority jobs, then the scheduler chooses and preempts one or many low priority jobs to release enough resources to enable the execution of high priority jobs. The preemptable jobs are selected based on the scheduler parameter preempt_sort attribute which accepts only min_time_since_start as a value. If the attribute is set (preempt_sort: min_time_since_start), then all the jobs that are eligible for preemption are sorted based on their start time, and the jobs with minimum start time or recently started jobs are selected for preemption. However, if the value is unset (i.e., the attribute is disabled), then the jobs with maximum start time or longest running jobs are selected for preemption.

 

Parameter vs Attributes

Parameters are set in the configuration file (sched_priv/sched_config or mom_priv/config) and the corresponding daemon (pbs_sched or pbs_mom) needs to be reloaded or restarted to make the changes.


In PBS, each daemon (pbs_server, pbs_sched) has an object which has a state and behavior. The state of an object is represented by data values or attributes, which provides a method to directly modify the object state. Each daemon exposes some of the attributes which can be modified directly without the need to restart the daemon. Most of the attributes are set using the qmgr command, where each attribute is listed by the object it is modifying.

 

            qmgr -c "set <object> <attribute> = <value>"

            qmgr -c "set server pbs_ default_queue = workq"      

            qmgr -c "set sched scheduling = True"

 

The characteristics of an attribute are:

Note: In PBS version ≤ 19.2.X preemption was configured through the scheduler parameters.

 

Configuration files

There are two configuration files that are used to set parameters for preemptive scheduling:

Scheduler Configuration

The Scheduler configuration file ($PBS_HOME/sched_priv/sched_config) is used to enable preemptive scheduling and the scheduler attributes are used for configuration. The substate and queue attribute of a job determines the preemption level of the job. The preemption level, in turn, determines the preemption priority or the order in which the jobs belonging to different levels are considered (preempting or preemptive job).

 

Scheduler parameters and attributes are used for configuring preemption are:

 

              preemptive_sched: true     ALL

   

set sched preempt_queue_prio = 150

set sched preempt_prio = "express_queue, normal_jobs"

set sched preempt_order = S

set sched preempt_sort = min_time_since_start

 

MoM Configuration

The MoM configuration file is used to specify the scripts to trigger (for checkpointing) or to change the state of jobs in response to the signals received from the server. The signals are sent from the server to the pbs_mom and the script corresponding to that signal is executed by the pbs_mom on the execution host. It can also be used to change the suspend and resume signal for jobs using the $suspendsig parameter.

 

Preemption Methods

The preemption method can be chosen depending on the site policy and the nature of applications running on the cluster. The preemption method is applied to all the jobs submitted in the cluster, so the method for preemption should be considered carefully, taking all scenarios into account.

There are four different ways that jobs can be preempted:



Some applications do not have the capability to checkpoint, and resume or recover the state of the job from the information(state) stored in a file. However, PBSPro provides the flexibility to use multiple preemption methods (SCR) together for different job stages or the amount of completion of the job. The preemption method is set using the preempt_order scheduler attribute and the default is SCR.

The preemption method can be chosen depending on the site policy and the nature of applications running on the cluster. The preemption method is applied to all the jobs submitted in the cluster, so the method for preemption should be considered carefully, taking all scenarios into account. Some applications do not have the capability to checkpoint, and resume or recover the state of the job from the information(state) stored in a file. However, PBSPro provides the flexibility to use multiple preemption methods (SCR) together for different job stages or the amount of completion of the job. The preemption method is set using the preempt_order scheduler attribute and the default is SCR.              

Suspend (S)

Scenario: If the application has just started executing and the site chooses to preserve the state of the job and resume from the same state.

Pros:

Cons:

 

Checkpoint (C)

Scenario: When an application running in the cluster has the capability to store the current state of the application and use that information to continue processing when execution resumes.

Pros

Cons:

 

Requeue (R)

Scenario: When jobs running in the cluster are small jobs and do not require much CPU time for completion or when jobs run for a short duration.

Pros:

Cons:

 

Deletion (D)

Scenario: When the jobs running in the cluster are no longer required, or when other methods for preemption have failed to release the resources occupied by the lower-priority job, or other jobs require a clean system to run.

Pros:

Cons:

 

Configuring Preemption Method - Suspend

Now let’s set up preemption in a test environment and analyze how preemption works with the default options.


Steps to configure:

# qmgr -c "set sched scheduling=false"

 

Update $PBS_HOME/sched_priv/sched_config

preemptive_sched:           true ALL

 

Update scheduler attributes using qmgr.

            

            To verify the default configuration of the scheduler:

             qmgr -c “print sched”        

      qmgr -c "set sched preempt_queue_prio = 150"

qmgr -c "set sched preempt_prio = 'express_queue, normal_jobs'"

qmgr -c "set sched preempt_order = S"

qmgr -c "set sched preempt_sort = min_time_since_start"

 

 

 

# kill -HUP $(pgrep -f pbs_sched) 

 

# qmgr -c "create queue expressq queue_type=execution,started=true,enabled=true,priority=200"

# qmgr -c "create queue normalq queue_type=execution,started=true,enabled=true"

 

# qmgr -c "set sched scheduling=true"

 

Test the Environment

Queue configuration 

[root@master ~]# qstat -Qf expressq normalq

Queue: expressq

    queue_type = Execution

    Priority = 200

    total_jobs = 0

    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0

    enabled = True

    started = True

 

Queue: normalq

    queue_type = Execution

    total_jobs = 0

    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0

    enabled = True

    started = True

 

Node configuration:

 

[root@master ~]# pbsnodes -Sav

vnode           state           OS       hardware host            queue        mem     ncpus   nmics   ngpus  comment

--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------

master          free            --       --       master          --              0 b       0       0       0 --

master_node1    free            --       --       master          --              2gb       2       0       0 --

master_node2    free            --       --       master          --              2gb       2       0       0 --

master_node3    free            --       --       master          --              2gb       2       0       0 --

master_node4    free            --       --       master          --              2gb       2       0       0 --

 

Testing

Submit a few jobs in a low-priority queue (normalq) to fill up the cluster and to create the circumstances for preemption. 

                                                                                                 

Without Preemption 

 

            At T0

                                                                                                                                           Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19492.master                   test_user       normalq         low_prio_job            11316    1     2    --  00:02 R 00:00

19493.master                   test_user       normalq         low_prio_job            11318    1     2    --  00:04 R 00:00

19494.master                   test_user       normalq         low_prio_job            11319    1     1    --  00:04 R 00:00

19495.master                   test_user       normalq         low_prio_job            11320    1     3    --  00:05 R 00:00

 

            

            At T1:                            

                                                                                                                                               Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19492.master                   test_user       normalq         low_prio_job            11316    1     2    --  00:02 R 00:01

19493.master                   test_user       normalq         low_prio_job            11318    1     2    --  00:04 R 00:01

19494.master                   test_user       normalq         low_prio_job            11319    1     1    --  00:04 R 00:01

19495.master                   test_user       normalq         low_prio_job            11320    1     3    --  00:05 R 00:01

 

At T2

 

                                                                                                                                           Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19492.master                   test_user       normalq         low_prio_job            11316    1     2    --  00:02 R 00:02

19493.master                   test_user       normalq         low_prio_job            11318    1     2    --  00:04 R 00:02

19494.master                   test_user       normalq         low_prio_job            11319    1     1    --  00:04 R 00:02

19495.master                   test_user       normalq         low_prio_job            11320    1     3    --  00:05 R 00:02

19496.master                   test_user       expressq        urgent_job           --     1     6    --  00:02 Q   --

                                                                                                                                              

At T3

             Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19493.master                   test_user       normalq         low_prio_job            11318    1     2    --  00:04 R 00:03

19494.master                   test_user       normalq         low_prio_job            11319    1     1    --  00:04 R 00:03

19495.master                   test_user       normalq         low_prio_job            11320    1     3    --  00:05 R 00:03

19496.master                   test_user       expressq        urgent_job       --     1     6    --  00:02 Q   --

 

            At T4

                         Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19493.master                   test_user       normalq         low_prio_job            11318    1     2    --  00:04 R 00:04

19494.master                   test_user       normalq         low_prio_job            11319    1     1    --  00:04 R 00:04

19495.master                   test_user       normalq         low_prio_job            11320    1     3    --  00:05 R 00:04

19496.master                   test_user       expressq        urgent_job       --     1     6    --  00:02 Q   --

            

            At T5

                                                                                                                                                               Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19495.master                   test_user       normalq         low_prio_job            11320    1     3    --  00:05 R 00:05

19496.master                   test_user       expressq        urgent_job       --     1     6    --  00:02 Q   --

 

            At T6

                                                 Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19496.master                   test_user       expressq        urgent_job       --     1     6    --  00:02 R 00:00



 

 

Analysis

Without preemption, the jobs are executed in the same sequence in which they are being submitted to the cluster. The high-priority (urgent) job must wait until all the jobs are finished and enough resources are released for its execution.

 

With Preemption 

 

At T0

 

                                                                                                                                               Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19689.master                   test_user       normalq         low_prio_job       11737    1     2    --  00:02 R 00:00

19690.master                   test_user       normalq         low_prio_job       11741    1     2    --  00:04 R 00:00

19691.master                   test_user       normalq         low_prio_job       11747    1     1    --  00:04 R 00:00

19692.master                   test_user       normalq         low_prio_job       11755    1     3    --  00:05 R 00:00

 

At T1

                                                                                                                                               Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19689.master                   test_user       normalq         low_prio_job       11737    1     2    --  00:02 R 00:01

19690.master                   test_user       normalq         low_prio_job       11741    1     2    --  00:04 R 00:01

19691.master                   test_user       normalq         low_prio_job       11747    1     1    --  00:04 R 00:01

19692.master                   test_user       normalq         low_prio_job       11755    1     3    --  00:05 R 00:01

 

Currently there are no free resources available in the cluster to run new jobs. So, if we submit a new job into high priority queue(expressq) then it should acquire resources from low-priority jobs to execute.

                                                                                                  

At T2

                                                                                                                                               Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19689.master                   test_user       normalq         low_prio_job       11737    1     2    --  00:02 R 00:02

19690.master                   test_user       normalq         low_prio_job       11741    1     2    --  00:04 S 00:01

19691.master                   test_user       normalq         low_prio_job       11747    1     1    --  00:04 S 00:01

19692.master                   test_user       normalq         low_prio_job       11755    1     3    --  00:05 S 00:01

19693.master                   test_user       expressq        urgent_job         11802    1     6    --  00:02 R 00:01

 

At T3

                                                                                                                                               Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19690.master                   test_user       normalq         low_prio_job       11741    1     2    --  00:04 S 00:01

19691.master                   test_user       normalq         low_prio_job       11747    1     1    --  00:04 S 00:01

19692.master                   test_user       normalq         low_prio_job       11755    1     3    --  00:05 S 00:01

19693.master                   test_user       expressq        urgent_job         11802    1     6    --  00:02 R 00:02

  

At T4

  

                                                                                                                                               Req'd  Req'd   Elap

Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----

19690.master                   test_user       normalq         low_prio_job       11741    1     2    --  00:04 R 00:02

19691.master                   test_user       normalq         low_prio_job       11747    1     1    --  00:04 R 00:02

19692.master                   test_user       normalq         low_prio_job       11755    1     3    --  00:05 R 00:02

 

 


Analysis

We submitted 5 jobs to the cluster and jobs 19689.master,19690.master, 19691.master and 19692.master belong to the normal_jobs preemption level where as job 19693.master belongs to the express_queue preemption level and Express job class. As express_queue is defined before normal_jobs via the preempt_prio parameter, the jobs in the express_queue (19693.master) can preempt jobs belonging to normal_jobs. The scheduler parameter preempt_sort has been set to minimum_time_since_start. Hence, all the jobs which are eligible for preemption (normal_jobs) are sorted based on the job’s start time and those job with minimum run time (19690.master, 19691.master and 19692.master) are selected for preemption.

 

Conclusion

Preemption or preemptive scheduling provides the capability to control the execution of jobs based on the priority of the tasks (job execution priority). There are four different methods in which the resources from low-priority jobs can be acquired. This article covers the basic configuration for implementing preemption.

 

References 

PBS Professional 2021.1 Admin Guide