Sharing Consumable Resources
CPU Management
(Disclaimer: In this "CPU Management" section, the term "consumable resource" does not include memory. The management of memory as a consumable resource is discussed in it's own section below.)
As of SLURM version 1.3, the select/cons_res
plugin
supports sharing consumable resources via the per-partition Shared
setting. Previously the select/cons_res
plugin ignored this
setting, since it was technically already "sharing" the nodes when it scheduled
the resources of each node to different jobs.
Now the per-partition Shared
setting applies to the entity
being selected for scheduling:
When the default
select/linear
plugin is enabled, the per-partitionShared
setting controls whether or not the nodes are shared among jobs.When the
select/cons_res
plugin is enabled, the per-partitionShared
setting controls whether or not the configured consumable resources are shared among jobs. When a consumable resource such as a core, socket, or CPU is shared, it means that more than one job can be assigned to it.
The following table describes this new functionality in more detail:
Selection Setting | Per-partition Shared Setting |
Resulting Behavior |
---|---|---|
SelectType=select/linear | Shared=NO | Whole nodes are allocated to jobs. No node will run more than one job. |
Shared=YES | Same as Shared=FORCE if job request specifies --shared option. Otherwise same as Shared=NO. | |
Shared=FORCE | Whole nodes are allocated to jobs. A node may run more than one job. | |
SelectType=select/cons_res Plus one of the following: SelectTypeParameters=CR_Core SelectTypeParameters=CR_Core_Memory |
Shared=NO | Cores are allocated to jobs. No core will run more than one job. |
Shared=YES | Allocate whole nodes if job request specifies --exclusive option. Otherwise same as Shared=FORCE. | |
Shared=FORCE | Cores are allocated to jobs. A core may run more than one job. | |
SelectType=select/cons_res Plus one of the following: SelectTypeParameters=CR_CPU SelectTypeParameters=CR_CPU_Memory |
Shared=NO | CPUs are allocated to jobs. No CPU will run more than one job. |
Shared=YES | Allocate whole nodes if job request specifies --exclusive option. Otherwise same as Shared=FORCE. | |
Shared=FORCE | CPUs are allocated to jobs. A CPU may run more than one job. | |
SelectType=select/cons_res Plus one of the following: SelectTypeParameters=CR_Socket SelectTypeParameters=CR_Socket_Memory |
Shared=NO | Sockets are allocated to jobs. No socket will run more than one job. |
Shared=YES | Allocate whole nodes if job request specifies --exclusive option. Otherwise same as Shared=FORCE. | |
Shared=FORCE | Sockets are allocated to jobs. A socket may run more than one job. |
When Shared=FORCE
is configured, the consumable resources are
scheduled for jobs using a least-loaded algorithm. Thus, idle
CPUs|cores|sockets will be allocated to a job before busy ones, and
CPUs|cores|sockets running one job will be allocated to a job before ones
running two or more jobs. This is the same approach that the
select/linear
plugin uses when allocating "shared" nodes.
Note that the granularity of the "least-loaded" algorithm is what
distinguishes the two selection plugins (cons_res
and
linear
) when Shared=FORCE
is configured. With the
select/cons_res
plugin enabled, the CPUs of a node are not
overcommitted until all of the rest of the CPUs are overcommitted on the
other nodes. Thus if one job allocates half of the CPUs on a node and then a
second job is submitted that requires more than half of the CPUs, the
select/cons_res
plugin will attempt to place this new job on other
busy nodes that have more than half of the CPUs available for use. The
select/linear
plugin simply counts jobs on nodes, and does not
track the CPU usage on each node.
This new functionality also supports the new
Shared=FORCE:<num>
syntax. If Shared=FORCE:3
is
configured with select/cons_res
and CR_Core
or
CR_Core_Memory
, then the select/cons_res
plugin will
run up to 3 jobs on each core of each node in the partition. If
CR_Socket
or CR_Socket_Memory
is configured, then the
select/cons_res
plugin will run up to 3 jobs on each socket
of each node in the partition.
Nodes in Multiple Partitions
SLURM has supported configuring nodes in more than one partition since version
0.7.0. The Shared=FORCE
support in the select/cons_res
plugin accounts for this "multiple partition" support. Here are several
scenarios with the select/cons_res
plugin enabled to help
understand how all of this works together:
SLURM configuration | Resulting Behavior |
---|---|
Two Shared=NO partitions assigned the same set of nodes |
Jobs from either partition will be assigned to all available consumable resources. No consumable resource will be shared. One node could have 2 jobs running on it, and each job could be from a different partition. |
Two partitions assigned the same set of nodes: one partition is
Shared=FORCE , and the other is Shared=NO |
A node will only run jobs from one partition at a time. If a node is
running jobs from the Shared=NO partition, then none of it's
consumable resources will be shared. If a node is running jobs from the
Shared=FORCE partition, then it's consumable resources can be
shared. |
Two Shared=FORCE partitions assigned the same set of nodes |
Jobs from either partition will be assigned consumable resources. All consumable resources can be shared. One node could have 2 jobs running on it, and each job could be from a different partition. |
Two partitions assigned the same set of nodes: one partition is
Shared=FORCE:3 , and the other is Shared=FORCE:5 |
Generally the same behavior as above. However no consumable resource will ever run more than 3 jobs from the first partition, and no consumable resource will ever run more than 5 jobs from the second partition. A consumable resource could have up to 8 jobs running on it at one time. |
Note that the "mixed shared setting" configuration (row #2 above) introduces the
possibility of starvation between jobs in each partition. If a set of
nodes are running jobs from the Shared=NO
partition, then these
nodes will continue to only be available to jobs from that partition, even if
jobs submitted to the Shared=FORCE
partition have a higher
priority. This works in reverse also, and in fact it's easier for jobs from the
Shared=FORCE
partition to hold onto the nodes longer because the
consumable resource "sharing" provides more resource availability for new jobs
to begin running "on top of" the existing jobs. This happens with the
select/linear
plugin also, so it's not specific to the
select/cons_res
plugin.
Memory Management
The management of memory as a consumable resource remains unchanged and can be used to prevent oversubscription of memory, which would result in having memory pages swapped out and severely degraded performance.
Selection Setting | Resulting Behavior |
---|---|
SelectType=select/linear | Memory allocation is not tracked. Jobs are allocated to nodes without considering if there is enough free memory. Swapping could occur! |
SelectType=select/cons_res Plus one of the following: SelectTypeParameters=CR_Core SelectTypeParameters=CR_CPU SelectTypeParameters=CR_Socket |
Memory allocation is not tracked. Jobs are allocated to consumable resources without considering if there is enough free memory. Swapping could occur! |
SelectType=select/cons_res Plus one of the following: SelectTypeParameters=CR_Memory SelectTypeParameters=CR_Core_Memory SelectTypeParameters=CR_CPU_Memory SelectTypeParameters=CR_Socket_Memory |
Memory allocation for all jobs are tracked. Nodes that do not have enough available memory to meet the job's memory requirement will not be allocated to the job. |
Users can specify their job's memory requirements one of two ways.
--mem=<num>
can be used to specify the job's memory
requirement on a per allocated node basis. This option is probably best
suited for use with the select/linear
plugin, which allocates
whole nodes to jobs.
--mem-per-cpu=<num>
can be used to specify the job's
memory requirement on a per allocated CPU basis. This is probably best
suited for use with the select/cons_res
plugin which can
allocate individual CPUs to jobs.
Default and maximum values for memory on a per node or per CPU basis can
be configued using the following options: DefMemPerCPU
,
DefMemPerNode
, MaxMemPerCPU
and MaxMemPerNode
.
Users can use the --mem
or --mem-per-cpu
option
at job submission time to specify their memory requirements.
Enforcement of a job's memory allocation is performed by the accounting
plugin, which periodically gathers data about running jobs. Set
JobAcctGather
and JobAcctFrequency
to
values suitable for your system.
Last modified 8 July 2008