Consumable Resources in SLURM

SLURM, using the default node allocation plug-in, allocates nodes to jobs in exclusive mode. This means that even when all the resources within a node are not utilized by a given job, another job will not have access to these resources. Nodes possess resources such as processors, memory, swap, local disk, etc. and jobs consume these resources. The exclusive use default policy in SLURM can result in inefficient utilization of the cluster and of its nodes resources.

A plug-in supporting CPUs as a consumable resource is available in SLURM 0.5.0 and newer versions of SLURM. Information on how to use this plug-in is described below.

Using the Consumable Resource Node Allocation Plugin: select/cons_res

  1. SLURM version 1.2 and newer
    • Consumable resources has been enhanced with several new resources --namely CPU (same as in previous version), Socket, Core, Memory as well as any combination of the logical processors with Memory:
      • CPU (CR_CPU): CPU as a consumable resource.
        • No notion of sockets, cores, or threads.
        • On a multi-core system CPUs will be cores.
        • On a multi-core/hyperthread system CPUs will be threads.
        • On a single-core systems CPUs are CPUs. ;-)
      • Socket (CR_Socket): Socket as a consumable resource.
      • Core (CR_Core): Core as a consumable resource.
      • Memory (CR_Memory) Memory only as a consumable resource. Note! CR_Memory assumes Shared=Yes
      • Socket and Memory (CR_Socket_Memory): Socket and Memory as consumable resources.
      • Core and Memory (CR_Core_Memory): Core and Memory as consumable resources.
      • CPU and Memory (CR_CPU_Memory) CPU and Memory as consumable resources.
    • In the cases where Memory is the consumable resource or one of the two consumable resources the RealMemory parameter, which defines a node's amount of real memory in slurm.conf, must be set when FastSchedule=1.
    • srun's -E extension for sockets, cores, and threads are ignored within the node allocation mechanism when CR_CPU or CR_CPU_MEMORY is selected. It is considered to compute the total number of tasks when -n is not specified.
    • A new srun switch --job-mem=MB was added to allow users to specify the maximum amount of real memory per node required by their application. This switch is needed in the environments were Memory is a consumable resource. It is important to specify enough memory since slurmd will not allow the application to use more than the requested amount of real memory per node. The default value for --job-mem is 1 MB. see srun man page for more details.
    • All CR_s assume Shared=No or Shared=Force EXCEPT for CR_MEMORY which assumes Shared=Yes
    • The consumable resource plugin is enabled via SelectType and SelectTypeParameter in the slurm.conf.
    • #
      # "SelectType"         : node selection logic for scheduling.
      #    "select/bluegene" : the default on BlueGene systems, aware of
      #                        system topology, manages bglblocks, etc.
      #    "select/cons_res" : allocate individual consumable resources
      #                        (i.e. processors, memory, etc.)
      #    "select/linear"   : the default on non-BlueGene systems,
      #                        no topology awareness, oriented toward
      #                        allocating nodes to jobs rather than
      #                        resources within a node (e.g. CPUs)
      #
      # SelectType=select/linear
      SelectType=select/cons_res
      
      # o Define parameters to describe the SelectType plugin. For
      #    - select/bluegene - this parameter is currently ignored
      #    - select/linear   - this parameter is currently ignored
      #    - select/cons_res - the parameters available are
      #       - CR_CPU  (1)  - CPUs as consumable resources.
      #                        No notion of sockets, cores, or threads.
      #                        On a multi-core system CPUs will be cores
      #                        On a multi-core/hyperthread system CPUs
      #                                        will be threads
      #                        On a single-core systems CPUs are CPUs.
      #      - CR_Socket (2) - Sockets as a consumable resource.
      #      - CR_Core   (3) - Cores as a consumable resource.
      #      - CR_Memory (4) - Memory as a consumable resource.
      #                        Note! CR_Memory assumes Shared=Yes
      #      - CR_Socket_Memory (5) - Socket and Memory as consumable
      #                               resources.
      #      - CR_Core_Memory (6)   - Core and Memory as consumable
      #                               resources. (Not yet implemented)
      #      - CR_CPU_Memory (7)    - CPU and Memory as consumable
      #                               resources.
      #
      # (#) refer to the output of "scontrol show config"
      #
      # NB!:   The -E extension for sockets, cores, and threads
      #        are ignored within the node allocation mechanism
      #        when CR_CPU or CR_CPU_MEMORY is selected.
      #        They are considered to compute the total number of
      #        tasks when -n is not specified
      #
      # NB! All CR_s assume Shared=No or Shared=Force EXCEPT for
      #        CR_MEMORY which assumes Shared=Yes
      #
      #SelectTypeParameters=CR_CPU (default)
      
    • Using --overcommit or -O is allowed in this new version of consumable resources. When the process to logical processor pinning is enabled (task/affinity plug-in) the extra processes will not affect co-scheduled jobs other than other jobs started with the -O flag. We are currently investigating alternative approaches of handling the pinning of jobs started with --overcommit
    • -c or --cpus-per-task works in this version of consumable resources
  2. General comments
    • SLURM's default select/linear plugin is using a best fit algorithm based on number of consecutive nodes. The same node allocation approach is used in select/cons_res for consistency.
    • The select/cons_res plugin is enabled or disabled cluster-wide.
    • In the case where select/cons_res is not enabled, the normal SLURM behaviors are not disrupted. The only changes, users see when using the select/cons_res plug-in, are that jobs can be co-scheduled on nodes when resources permits it. The rest of SLURM such as srun and switches (except srun -s ...), etc. are not affected by this plugin. SLURM is, from a user point of view, working the same way as when using the default node selection scheme.
    • The --exclusive srun switch allows users to request nodes in exclusive mode even when consumable resources is enabled. see "man srun" for details.
    • srun's -s or --share is incompatible with the consumable resource environment and will therefore not be honored. Since in this environment nodes are shared by default, --exclusive allows users to obtain dedicated nodes.

Limitation and future work

We are aware of several limitations with the current consumable resource plug-in and plan to make enhancement the plug-in as we get time as well as request from users to help us prioritize the features. Please send comments and requests about the consumable resources to slurm-dev@lists.llnl.gov.

  1. Issue with --max_nodes, --max_sockets_per_node, --max_cores_per_socket and --max_threads_per_core
    • Problem: The example below was achieve when using CR_CPU (default mode). The systems are all "dual socket, dual core, single threaded systems (= 4 cpus per system)".
    • The first 3 serial jobs are being allocated to node hydra12 which means that one CPU is still available on hydra12.
    • The 4th job "srun -N 2-2 -E 2:2 sleep 100" requires 8 CPUs and since the algorithm fills up nodes in a consecutive order (when not in dedicated mode) the algorithm will want to use the remaining CPUs on Hydra12 first. Because the user has requested a maximum of two nodes the allocation will put the job on hold until hydra12 becomes available or if backfill is enabled until hydra12's remaining CPU gets allocated to another job which will allow the 4th job to get two dedicated nodes
    • Note! This problem is fixed in SLURM version 1.3.
    • Note! If you want to specify --max_???? this problem can be solved in the current implementation by asking for the nodes in dedicated mode using --exclusive
    • .
      # srun sleep 100 &
      # srun sleep 100 &
      # srun sleep 100 &
      # squeue
      JOBID PARTITION   NAME     USER  ST   TIME  NODES NODELIST(REASON)
       1132  allNodes  sleep   sballe   R   0:05      1 hydra12
       1133  allNodes  sleep   sballe   R   0:04      1 hydra12
       1134  allNodes  sleep   sballe   R   0:02      1 hydra12
      # srun -N 2-2 -E 2:2 sleep 100 &
      srun: job 1135 queued and waiting for resources
      #squeue
      JOBID PARTITION   NAME     USER  ST   TIME  NODES NODELIST(REASON)
       1135  allNodes  sleep   sballe  PD   0:00      2 (Resources)
       1132  allNodes  sleep   sballe   R   0:24      1 hydra12
       1133  allNodes  sleep   sballe   R   0:23      1 hydra12
       1134  allNodes  sleep   sballe   R   0:21      1 hydra12
      
    • Proposed solution: Enhance the selection mechanism to go through {node,socket,core,thread}-tuplets to find available match for specific request (bounded knapsack problem).
  2. Binding of processes in the case when --overcommit is specified.
    • In the current implementation (SLURM 1.2) we have chosen not to bind process that have been started with --overcommit flag. The reasoning behind this decision is that the Linux scheduler will move non-bound processes to available resources when jobs with process pinning enabled are started. The non-bound jobs do not affect the bound jobs but co-scheduled non-bound job would affect each others runtime. We have decided that for now this is an adequate solution.

Examples of CR_Memory, CR_Socket_Memory, and CR_CPU_Memory type consumable resources

# sinfo -lNe
NODELIST     NODES PARTITION  STATE  CPUS  S:C:T MEMORY
hydra[12-16]     5 allNodes*  ...       4  2:2:1   2007

Using select/cons_res plug-in with CR_Memory

Example:
# srun -N 5 -n 20 --job-mem=1000 sleep 100 &  <-- running
# srun -N 5 -n 20 --job-mem=10 sleep 100 &    <-- running
# srun -N 5 -n 10 --job-mem=1000 sleep 100 &  <-- queued and waiting for resources

# squeue
JOBID PARTITION   NAME     USER  ST   TIME  NODES NODELIST(REASON)
 1820  allNodes  sleep   sballe  PD   0:00      5 (Resources)
 1818  allNodes  sleep   sballe   R   0:17      5 hydra[12-16]
 1819  allNodes  sleep   sballe   R   0:11      5 hydra[12-16]

Using select/cons_res plug-in with CR_Socket_Memory (2 sockets/node)

Example 1:
# srun -N 5 -n 5 --job-mem=1000 sleep 100 &        <-- running
# srun -n 1 -w hydra12 --job-mem=2000 sleep 100 &  <-- queued and waiting for resources

# squeue
JOBID PARTITION   NAME     USER  ST   TIME  NODES NODELIST(REASON)
 1890  allNodes  sleep   sballe  PD   0:00      1 (Resources)
 1889  allNodes  sleep   sballe   R   0:08      5 hydra[12-16]

Example 2:
# srun -N 5 -n 10 --job-mem=10 sleep 100 & <-- running
# srun -n 1 --job-mem=10 sleep 100 & <-- queued and waiting for resourcessqueue

# squeue
JOBID PARTITION   NAME     USER  ST   TIME  NODES NODELIST(REASON)
 1831  allNodes  sleep   sballe  PD   0:00      1 (Resources)
 1830  allNodes  sleep   sballe   R   0:07      5 hydra[12-16]

Using select/cons_res plug-in with CR_CPU_Memory (4 CPUs/node)

Example 1:
# srun -N 5 -n 5 --job-mem=1000 sleep 100 &  <-- running
# srun -N 5 -n 5 --job-mem=10 sleep 100 &    <-- running
# srun -N 5 -n 5 --job-mem=1000 sleep 100 &  <-- queued and waiting for resources

# squeue
JOBID PARTITION   NAME     USER  ST   TIME  NODES NODELIST(REASON)
 1835  allNodes  sleep   sballe  PD   0:00      5 (Resources)
 1833  allNodes  sleep   sballe   R   0:10      5 hydra[12-16]
 1834  allNodes  sleep   sballe   R   0:07      5 hydra[12-16]

Example 2:
# srun -N 5 -n 20 --job-mem=10 sleep 100 & <-- running
# srun -n 1 --job-mem=10 sleep 100 &       <-- queued and waiting for resources

# squeue
JOBID PARTITION   NAME     USER  ST   TIME  NODES NODELIST(REASON)
 1837  allNodes  sleep   sballe  PD   0:00      1 (Resources)
 1836  allNodes  sleep   sballe   R   0:11      5 hydra[12-16]

Example of Node Allocations Using Consumable Resource Plugin

The following example illustrates the different ways four jobs are allocated across a cluster using (1) SLURM's default allocation (exclusive mode) and (2) a processor as consumable resource approach.

It is important to understand that the example listed below is a contrived example and is only given here to illustrate the use of cpu as consumable resources. Job 2 and Job 3 call for the node count to equal the processor count. This would typically be done because that one task per node requires all of the memory, disk space, etc. The bottleneck would not be processor count.

Trying to execute more than one job per node will almost certainly severely impact parallel job's performance. The biggest beneficiary of cpus as consumable resources will be serial jobs or jobs with modest parallelism, which can effectively share resources. On a lot of systems with larger processor count, jobs typically run one fewer task than there are processors to minimize interference by the kernel and daemons.

The example cluster is composed of 4 nodes (10 cpus in total):

The four jobs are the following:

The user launches them in the same order as listed above.

Using SLURM's Default Node Allocation (Non-shared Mode)

The four jobs have been launched and 3 of the jobs are now pending, waiting to get resources allocated to them. Only Job 2 is running since it uses one cpu on all 4 nodes. This means that linux01 to linux03 each have one idle cpu and linux04 has 3 idle cpus.

# squeue
JOBID PARTITION   NAME   USER  ST   TIME  NODES NODELIST(REASON)
    3       lsf  sleep   root  PD   0:00      3 (Resources)
    4       lsf  sleep   root  PD   0:00      1 (Resources)
    5       lsf  sleep   root  PD   0:00      1 (Resources)
    2       lsf  sleep   root   R   0:14      4 xc14n[13-16]

Once Job 2 is finished, Job 3 is scheduled and runs on linux01, linux02, and linux03. Job 3 is only using one cpu on each of the 3 nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3 and Job 4 can run concurrently on the cluster.

Job 5 has to wait for idle nodes to be able to run.

# squeue
JOBID PARTITION   NAME   USER  ST   TIME  NODES NODELIST(REASON)
    5       lsf  sleep   root  PD   0:00      1 (Resources)
    3       lsf  sleep   root   R   0:11      3 xc14n[13-15]
    4       lsf  sleep   root   R   0:11      1 xc14n16

Once Job 3 finishes, Job 5 is allocated resources and can run.

The advantage of the exclusive mode scheduling policy is that the a job gets all the resources of the assigned nodes for optimal parallel performance. The drawback is that jobs can tie up large amount of resources that it does not use and which cannot be shared with other jobs.

Using a Processor Consumable Resource Approach

The output of squeue shows that we have 3 out of the 4 jobs allocated and running. This is a 2 running job increase over the default SLURM approach.

Job 2 is running on nodes linux01 to linux04. Job 2's allocation is the same as for SLURM's default allocation which is that it uses one cpu on each of the 4 nodes. Once Job 2 is scheduled and running, nodes linux01, linux02 and linux03 still have one idle cpu each and node linux04 has 3 idle cpus. The main difference between this approach and the exclusive mode approach described above is that idle cpus within a node are now allowed to be assigned to other jobs.

It is important to note that assigned doesn't mean oversubscription. The consumable resource approach tracks how much of each available resource (in our case cpus) must be dedicated to a given job. This allows us to prevent per node oversubscription of resources (cpus).

Once Job 2 is running, Job 3 is scheduled onto node linux01, linux02, and Linux03 (using one cpu on each of the nodes) and Job 4 is scheduled onto one of the remaining idle cpus on Linux04.

Job 2, Job 3, and Job 4 are now running concurrently on the cluster.


# squeue
JOBID PARTITION   NAME   USER  ST   TIME  NODES NODELIST(REASON)
    5       lsf  sleep   root  PD   0:00      1 (Resources)
    2       lsf  sleep   root   R   0:13      4 linux[01-04]
    3       lsf  sleep   root   R   0:09      3 linux[01-03]
    4       lsf  sleep   root   R   0:05      1 linux04

# sinfo -lNe
NODELIST     NODES PARTITION       STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
linux[01-03]     3      lsf*   allocated    2   2981        1      1   (null) none
linux04          1      lsf*   allocated    4   3813        1      1   (null) none

Once Job 2 finishes, Job 5, which was pending, is allocated available resources and is then running as illustrated below:

# squeue
JOBID PARTITION   NAME   USER  ST   TIME  NODES NODELIST(REASON)
   3       lsf   sleep   root   R   1:58      3 linux[01-03]
   4       lsf   sleep   root   R   1:54      1 linux04
   5       lsf   sleep   root   R   0:02      3 linux[01-03]
# sinfo -lNe
NODELIST     NODES PARTITION       STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
linux[01-03]     3      lsf*   allocated    2   2981        1      1   (null) none
linux04          1      lsf*        idle    4   3813        1      1   (null) none

Job 3, Job 4, and Job 5 are now running concurrently on the cluster.

# squeue
JOBID PARTITION   NAME   USER  ST   TIME  NODES NODELIST(REASON)
    5       lsf  sleep   root   R   1:52      3 linux[01-03]

Job 3 and Job 4 have finished and Job 5 is still running on nodes linux[01-03].

The advantage of the consumable resource scheduling policy is that the job throughput can increase dramatically. The overall job throughput/productivity of the cluster increases thereby reducing the amount of time users have to wait for their job to complete as well as increasing the overall efficiency of the use of the cluster. The drawback is that users do not have the entire node dedicated to their job since they have to share nodes with other jobs if they do not use all of the resources on the nodes.

We have added a "--exclusive" switch to srun which allow users to specify that they would like their allocated nodes in exclusive mode. For more information see "man srun". The reason for that is if users have mpi//threaded/openMP programs that will take advantage of all the cpus within a node but only need one mpi process per node.

Last modified 8 July 2008