Generic Resource (GRES) Scheduling

Beginning in SLURM version 2.2 generic resource (Gres) scheduling is supported through a flexible plugin mechanism. Support is initially provided for Graphics Processing Units (GPUs), although support for any resources is possible.

Configuration

SLURM supports no generic resourses in the default configuration. One must explicitly specify which resources are to be managed in the slurm.conf configuration file. The configuration parameters of interest are:

Note that the Gres specification for each node works in the same fashion as the other resources managed. Depending upon the value of the FastSchedule parameter, nodes which are found to have fewer resources than configured will be placed in a DOWN state.

Note that the Gres specification is not supported on BlueGene systems.

Each compute node with generic resources must also contain a gres.conf file describing which resources are available on the node, their count, associated device files and CPUs which should be used with those resources. The configuration parameters available are:

Sample gres.conf file:

# Configure support for our four GPUs
Name=gpu File=/dev/nvidia0 CPUs=0,1
Name=gpu File=/dev/nvidia1 CPUs=0,1
Name=gpu File=/dev/nvidia2 CPUs=2,3
Name=gpu File=/dev/nvidia3 CPUs=2,3
Name=bandwidth Count=20M

Running Jobs

Jobs will not be allocated any generic resources unless specifically requested at job submit time using the --gres option supported by the salloc, sbatch and srun commands. The option requires an argument specifying which generic resources are required and how many resources. The resource specification is of the form name[:count]. The name is the same name as specified by the GresTypes and Gres configuration parameters. count specifies how many resources are required and has a default value of 1. For example:
sbatch --gres=gpu:2 ....

Jobs will be allocated specific generic resources as needed to satisfy the request. If the job is suspended, those resources do not become available for use by other jobs.

Job steps can be allocated generic resources from those allocated to the job using the --gres option with the srun command as described above. By default, a job step will be allocated none of the generic resources allocated to the job, but must explicitly request desired generic resources. This design choice was based upon a scenario where each job executes many job steps. If job steps were granted access to all generic resources by default, some job steps would need to explicitly specify zero generic resource counts, which we considered more confusing. The job step can be allocated specific generic resources and those resources will not be available to other job steps. A simple example is shown below.

#!/bin/bash
#
# gres_test.bash
# Submit as follows:
# sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash
#
srun --gres=gpu:2 -n2 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
wait

GPU Management

In the case of SLURM's GRES plugin for GPUs, the environment variable CUDA_VISIBLE_DEVICES is set for each job step to determine which GPUs are available for its use on each node. This environment variable is only set when tasks are launched on a specific compute node (no global environment variable is set for the salloc command and the environment variable set for the sbatch command only reflects the GPUs allocated to that job on that node, node zero of the allocation). CUDA version 3.1 (or higher) uses this environment variable in order to run multiple jobs or job steps on a node with GPUs and insure that the resources assigned to each are unique. In the example above, the allocated node may have four or more graphics devices. In that case, CUDA_VISIBLE_DEVICES will reference unique devices for each file and the output might resemble this:

JobStep=1234.0 CUDA_VISIBLE_DEVICES=0,1
JobStep=1234.1 CUDA_VISIBLE_DEVICES=2
JobStep=1234.2 CUDA_VISIBLE_DEVICES=3

NOTE: Be sure to specify the File parameters in the gres.conf file and insure they are in the increasing numeric order.

Last modified 1 August 2011