Preemption
SLURM supports job preemption, the act of stopping one or more "low-priority" jobs to let a "high-priority" job run uninterrupted until it completes. Job preemption is implemented as a variation of SLURM's Gang Scheduling logic. When a high-priority job has been allocated resources that have already been allocated to one or more low priority jobs, the low priority job(s) are preempted. The low priority job(s) can resume once the high priority job completes. Alternately, the low priority job(s) can be requeued and started using other resources if so configured in newer versions of SLURM.
In SLURM version 2.0 and earlier, high priority work is identified by the priority of the job's partition and low priority jobs are always suspended. The job preemption logic is within the sched/gang plugin. In SLURM version 2.1 and higher, the job's partition priority or its Quality Of Service (QOS) can be used to identify the which jobs can preempt or be preempted by other jobs.
SLURM version 2.1 offers several options for the job preemption mechanism including checkpoint, requeue, or cancel. the option of requeuing low priority jobs Checkpointed jobs are not automatically requeued or restarted. Requeued jobs may restart faster by using different resources. All of these new job preemption mechanisms release a job's memory space for use by other jobs. In SLURM version 2.1, some job preemption logic was moved into the select plugin and main code base to permit use of both job preemption plus the backfill scheduler plugin, sched/backfill.
SLURM version 2.2 offers the ability to configure the preemption mechanism used on a per partition or per QOS basis. For example, jobs in a low priority queue may get requeued, while jobs in a medium priority queue may get suspended.
Configuration
There are several important configuration parameters relating to preemption:
- SelectType: SLURM job preemption logic supports nodes allocated by the select/linear plugin and socket/core/CPU resources allocated by the select/cons_res plugin.
-
SelectTypeParameter: Since resources may be getting over-allocated
with jobs (suspended jobs remain in memory), the resource selection
plugin should be configured to track the amount of memory used by each job to
ensure that memory page swapping does not occur. When select/linear is
chosen, we recommend setting SelectTypeParameter=CR_Memory. When
select/cons_res is chosen, we recommend including Memory as a resource
(ex. SelectTypeParameter=CR_Core_Memory).
NOTE: Unless PreemptMode=SUSPEND,GANG these memory management parameters are not critical. -
DefMemPerCPU: Since job requests may not explicitly specify
a memory requirement, we also recommend configuring
DefMemPerCPU (default memory per allocated CPU) or
DefMemPerNode (default memory per allocated node).
It may also be desirable to configure
MaxMemPerCPU (maximum memory per allocated CPU) or
MaxMemPerNode (maximum memory per allocated node) in slurm.conf.
Users can use the --mem or --mem-per-cpu option
at job submission time to specify their memory requirements.
NOTE: Unless PreemptMode=SUSPEND,GANG these memory management parameters are not critical. - GraceTime: Specifies a time period for a job to execute after it is selected to be preempted. This option can be specified by partition or QOS using the slurm.conf file or database respectively. This option is only honored if PreemptMode=CANCEL. The GraceTime is specified in seconds and the default value is zero, which results in no preemption delay. Once a job has been selected for preemption, it's end time is set to the current time plus GraceTime and the mechanism used to terminate jobs upon reaching their time limit is used to cancel the job.
-
JobAcctGatherType and JobAcctGatherFrequency: The "maximum data segment
size" and "maximum virtual memory size" system limits will be configured for
each job to ensure that the job does not exceed its requested amount of memory.
If you wish to enable additional enforcement of memory limits, configure job
accounting with the JobAcctGatherType and JobAcctGatherFrequency
parameters. When accounting is enabled and a job exceeds its configured memory
limits, it will be canceled in order to prevent it from adversely effecting
other jobs sharing the same resources.
NOTE: Unless PreemptMode=SUSPEND,GANG these memory management parameters are not critical. -
PreemptMode: Specifies the mechanism used to preempt low priority jobs.
The PreemptMode can be specified on a system-wide basis or on a per-partition
basis when PreemptType=preempt/partition_prio. Note that when specified
on a partition, a compatible mode must also be specified system-wide;
specifically if a PreemptMode is set to SUSPEND for any partition, then
the system-wide PreemptMode must include the GANG parameter so the module
responsible for resuming jobs executes.
Configure to CANCEL, CHECKPOINT,
SUSPEND or REQUEUE depending on the desired action for low
priority jobs.
The GANG option must also be specified if gang scheduling is desired
or a PreemptMode of SUSPEND is used for any jobs.
- A value of CANCEL will always cancel the job.
- A value of CHECKPOINT will checkpoint (if possible) or kill low priority jobs. Checkpointed jobs are not automatically restarted.
- A value of REQUEUE will requeue (if possible) or kill low priority jobs. Requeued jobs are permitted to be restarted on different resources.
- A value of SUSPEND will suspend and automatically resume the low priority jobs. The SUSPEND option must be used with the GANG option (e.g. "PreemptMode=SUSPEND,GANG") and with BPreemptType=preempt/partition_prio (the logic to suspend and resume jobs currently only has the data structures to support partitions).
- A value of GANG may be used with any of the above values and will execute a module responsible for resuming jobs previously suspended for either gang scheduling or job preemption with suspension.
-
PreemptType: Configure to the desired mechanism used to identify
which jobs can preempt other jobs.
- preempt/none indicates that jobs will not preempt each other (default).
- preempt/partition_prio indicates that jobs from one partition can preempt jobs from lower priority partitions.
- preempt/qos indicates that jobs from one Quality Of Service (QOS) can preempt jobs from a lower QOS. These jobs can be in the same partition or different partitions. PreemptMode must be set to CANCEL, CHECKPOINT, or REQUEUE. This option requires the use of a database identifying available QOS and their preemption rules. This option is not compatible with PreemptMode=OFF or PreemptMode=SUSPEND (i.e. preempted jobs must be removed from the resources).
-
Priority: Configure the partition's Priority setting relative to
other partitions to control the preemptive behavior when
PreemptType=preempt/partition_prio.
This option is not relevant if PreemptType=preempt/qos.
If two jobs from two
different partitions are allocated to the same resources, the job in the
partition with the greater Priority value will preempt the job in the
partition with the lesser Priority value. If the Priority values
of the two partitions are equal then no preemption will occur. The default
Priority value is 1.
NOTE: Unless PreemptType=preempt/partition_prio the partition Priority is not critical. - Shared: Configure the partition's Shared setting to FORCE for all partitions in which job preemption using a suspend/resume mechanism is used or NO otherwise. The FORCE option supports an additional parameter that controls how many jobs can share a resource (FORCE[:max_share]). By default the max_share value is 4. In order to preempt jobs (and not gang schedule them), always set max_share to 1. To allow up to 2 jobs from this partition to be allocated to a common resource (and gang scheduled), set Shared=FORCE:2.
To enable preemption after making the configuration changes described above, restart SLURM if it is already running. Any change to the plugin settings in SLURM requires a full restart of the daemons. If you just change the partition Priority or Shared setting, this can be updated with scontrol reconfig.
Preemption Design and Operation
The select plugin will identify resources where a pending job can begin execution. When PreemptMode is configured to CANCEL, CHECKPOINT, SUSPEND or REQUEUE, the select plugin will also preempt running jobs as needed to initiate the pending job. When PreemptMode=SUSPEND,GANG the select plugin will initiate the pending job and rely upon the gang scheduling logic to perform job suspend and resume as described below.
When enabled, the gang scheduling logic (which is also supports job preemption) keeps track of the resources allocated to all jobs. For each partition an "active bitmap" is maintained that tracks all concurrently running jobs in the SLURM cluster. Each partition also maintains a job list for that partition, and a list of "shadow" jobs. The "shadow" jobs are high priority job allocations that "cast shadows" on the active bitmaps of the low priority jobs. Jobs caught in these "shadows" will be preempted.
Each time a new job is allocated to resources in a partition and begins running, the gang scheduler adds a "shadow" of this job to all lower priority partitions. The active bitmap of these lower priority partitions are then rebuilt, with the shadow jobs added first. Any existing jobs that were replaced by one or more "shadow" jobs are suspended (preempted). Conversely, when a high priority running job completes, it's "shadow" goes away and the active bitmaps of the lower priority partitions are rebuilt to see if any suspended jobs can be resumed.
The gang scheduler plugin is designed to be reactive to the resource allocation decisions made by the "select" plugins. The "select" plugins have been enhanced to recognize when job preemption has been configured, and to factor in the priority of each partition when selecting resources for a job. When choosing resources for each job, the selector avoids resources that are in use by other jobs (unless sharing has been configured, in which case it does some load-balancing). However, when job preemption is enabled, the select plugins may choose resources that are already in use by jobs from partitions with a lower priority setting, even when sharing is disabled in those partitions.
This leaves the gang scheduler in charge of controlling which jobs should run on the over-allocated resources. If PreemptMode=SUSPEND, jobs are suspended using the same internal functions that support scontrol suspend and scontrol resume. A good way to observe the operation of the gang scheduler is by running squeue -i<time> in a terminal window.
A Simple Example
The following example is configured with select/linear and PreemptMode=SUSPEND,GANG. This example takes place on a cluster of 5 nodes:
[user@n16 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST active* up infinite 5 idle n[12-16] hipri up infinite 5 idle n[12-16]
Here are the Partition settings:
[user@n16 ~]$ grep PartitionName /shared/slurm/slurm.conf PartitionName=DEFAULT Shared=FORCE:1 Nodes=n[12-16] PartitionName=active Priority=1 Default=YES PartitionName=hipri Priority=2
The runit.pl script launches a simple load-generating app that runs for the given number of seconds. Submit 5 single-node runit.pl jobs to run on all nodes:
[user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 485 [user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 486 [user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 487 [user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 488 [user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 489 [user@n16 ~]$ squeue -Si JOBID PARTITION NAME USER ST TIME NODES NODELIST 485 active runit.pl user R 0:06 1 n12 486 active runit.pl user R 0:06 1 n13 487 active runit.pl user R 0:05 1 n14 488 active runit.pl user R 0:05 1 n15 489 active runit.pl user R 0:04 1 n16
Now submit a short-running 3-node job to the hipri partition:
[user@n16 ~]$ sbatch -N3 -p hipri ./runit.pl 30 sbatch: Submitted batch job 490 [user@n16 ~]$ squeue -Si JOBID PARTITION NAME USER ST TIME NODES NODELIST 485 active runit.pl user S 0:27 1 n12 486 active runit.pl user S 0:27 1 n13 487 active runit.pl user S 0:26 1 n14 488 active runit.pl user R 0:29 1 n15 489 active runit.pl user R 0:28 1 n16 490 hipri runit.pl user R 0:03 3 n[12-14]
Job 490 in the hipri partition preempted jobs 485, 486, and 487 from the active partition. Jobs 488 and 489 in the active partition remained running.
This state persisted until job 490 completed, at which point the preempted jobs were resumed:
[user@n16 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 485 active runit.pl user R 0:30 1 n12 486 active runit.pl user R 0:30 1 n13 487 active runit.pl user R 0:29 1 n14 488 active runit.pl user R 0:59 1 n15 489 active runit.pl user R 0:58 1 n16
Another Example
In this example we have three different partitions using three different job preemption mechanisms.
# Excerpt from slurm.conf PartitionName=low Nodes=linux Default=YES Shared=NO Priority=10 PreemptMode=requeue PartitionName=med Nodes=linux Default=NO Shared=FORCE:1 Priority=20 PreemptMode=suspend PartitionName=hi Nodes=linux Default=NO Shared=FORCE:1 Priority=30 PreemptMode=off
$ sbatch tmp Submitted batch job 94 $ sbatch -p med tmp Submitted batch job 95 $ sbatch -p hi tmp Submitted batch job 96 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 96 hi tmp moe R 0:04 1 linux 94 low tmp moe PD 0:00 1 (Resources) 95 med tmp moe S 0:02 1 linux (after job 96 completes) $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 94 low tmp moe PD 0:00 1 (Resources) 95 med tmp moe R 0:24 1 linux
Future Ideas
More intelligence in the select plugins: This implementation of preemption relies on intelligent job placement by the select plugins. In SLURM version 2.0 preemptive placement support was added to the SelectType plugins, but there is still room for improvement.
Take the following example:
[user@n8 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST active* up infinite 5 idle n[1-5] hipri up infinite 5 idle n[1-5] [user@n8 ~]$ sbatch -N1 -n2 ./sleepme 60 sbatch: Submitted batch job 17 [user@n8 ~]$ sbatch -N1 -n2 ./sleepme 60 sbatch: Submitted batch job 18 [user@n8 ~]$ sbatch -N1 -n2 ./sleepme 60 sbatch: Submitted batch job 19 [user@n8 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 17 active sleepme cholmes R 0:03 1 n1 18 active sleepme cholmes R 0:03 1 n2 19 active sleepme cholmes R 0:02 1 n3 [user@n8 ~]$ sbatch -N3 -n6 -p hipri ./sleepme 20 sbatch: Submitted batch job 20 [user@n8 ~]$ squeue -Si JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 17 active sleepme cholmes S 0:16 1 n1 18 active sleepme cholmes S 0:16 1 n2 19 active sleepme cholmes S 0:15 1 n3 20 hipri sleepme cholmes R 0:03 3 n[1-3] [user@n8 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST active* up infinite 3 alloc n[1-3] active* up infinite 2 idle n[4-5] hipri up infinite 3 alloc n[1-3] hipri up infinite 2 idle n[4-5]
It would be more ideal if the "hipri" job were placed on nodes n[3-5], which would allow jobs 17 and 18 to continue running. However, a new "intelligent" algorithm would have to include factors such as job size and required nodes in order to support ideal placements such as this, which can quickly complicate the design. Any and all help is welcome here!
Last modified 16 May 2011