MPI Use Guide

MPI use depends upon the type of MPI being used. There are three fundamentally different modes of operation used by these various MPI implementations.

  1. SLURM directly launches the tasks and performs initialization of communications (Quadrics MPI, MPICH2, MPICH-GM, MPICH-MX, MVAPICH, MVAPICH2, some MPICH1 modes, and future versions of OpenMPI).
  2. SLURM creates a resource allocation for the job and then mpirun launches tasks using SLURM's infrastructure (OpenMPI, LAM/MPI and HP-MPI).
  3. SLURM creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than SLURM, such as SSH or RSH (BlueGene MPI and some MPICH1 modes). These tasks initiated outside of SLURM's monitoring or control. SLURM's epilog should be configured to purge these tasks when the job's allocation is relinquished.

Two SLURM parameters control which MPI implementation will be supported. Proper configuration is essential for SLURM to establish the proper environment for the MPI job, such as setting the appropriate environment variables. The MpiDefault configuration parameter in slurm.conf establishes the system default MPI to be supported. The srun option --mpi= (or the equivalent environment variable SLURM_MPI_TYPE can be used to specify when a different MPI implementation is to be supported for an individual job.

Links to instructions for using several varieties of MPI with SLURM are provided below.


Open MPI

The current versions of SLURM and Open MPI support task launch using the srun command. It relies upon SLURM version 2.0 (or higher) managing reservations of communication ports for use by the Open MPI version 1.5 (or higher). The system administrator must specify the range of ports to be reserved in the slurm.conf file using the MpiParams parameter. For example:
MpiParams=ports=12000-12999

Launch tasks using the srun command plus the option --resv-ports. The ports reserved on every allocated node will be identified in an environment variable available to the tasks as shown here:
SLURM_STEP_RESV_PORTS=12000-12015

If the ports reserved for a job step are found by the Open MPI library to be in use, a message of this form will be printed and the job step will be re-launched:
srun: error: sun000: task 0 unble to claim reserved port, retrying
After three failed attempts, the job step will be aborted. Repeated failures should be reported to your system administrator in order to rectify the problem by cancelling the processes holding those ports.

Older releases

Older versions of Open MPI and SLURM rely upon SLURM to allocate resources for the job and then mpirun to initiate the tasks. For example:

$ salloc -n4 sh    # allocates 4 processors
                   # and spawns shell for job
> mpirun a.out
> exit             # exits shell spawned by
                   # initial salloc command


Quadrics MPI

Quadrics MPI relies upon SLURM to allocate resources for the job and srun to initiate the tasks. One would build the MPI program in the normal manner then initiate it using a command line of this sort:

$ srun [options] <program> [program args]

LAM/MPI

LAM/MPI relies upon the SLURM salloc or sbatch command to allocate. In either case, specify the maximum number of tasks required for the job. Then execute the lamboot command to start lamd daemons. lamboot utilizes SLURM's srun command to launch these daemons. Do not directly execute the srun command to launch LAM/MPI tasks. For example:

$ salloc -n16 sh  # allocates 16 processors
                  # and spawns shell for job
> lamboot
> mpirun -np 16 foo args
1234 foo running on adev0 (o)
2345 foo running on adev1
etc.
> lamclean
> lamhalt
> exit            # exits shell spawned by
                  # initial srun command

Note that any direct use of srun will only launch one task per node when the LAM/MPI plugin is configured as the default plugin. To launch more than one task per node using the srun command, the --mpi=none option would be required to explicitly disable the LAM/MPI plugin if that is the system default.


HP-MPI

HP-MPI uses the mpirun command with the -srun option to launch jobs. For example:

$MPI_ROOT/bin/mpirun -TCP -srun -N8 ./a.out


MPICH2

MPICH2 jobs are launched using the srun command. Just link your program with SLURM's implementation of the PMI library so that tasks can communicate host and port information at startup. (The system administrator can add these option to the mpicc and mpif77 commands directly, so the user will not need to bother). For example:

$ mpicc -L<path_to_slurm_lib> -lpmi ...
$ srun -n20 a.out
NOTES:


MPICH-GM

MPICH-GM jobs can be launched directly by srun command. SLURM's mpichgm MPI plugin must be used to establish communications between the launched tasks. This can be accomplished either using the SLURM configuration parameter MpiDefault=mpichgm in slurm.conf or srun's --mpi=mpichgm option.

$ mpicc ...
$ srun -n16 --mpi=mpichgm a.out

MPICH-MX

MPICH-MX jobs can be launched directly by srun command. SLURM's mpichmx MPI plugin must be used to establish communications between the launched tasks. This can be accomplished either using the SLURM configuration parameter MpiDefault=mpichmx in slurm.conf or srun's --mpi=mpichmx option.

$ mpicc ...
$ srun -n16 --mpi=mpichmx a.out

MVAPICH

MVAPICH jobs can be launched directly by srun command. SLURM's mvapich MPI plugin must be used to establish communications between the launched tasks. This can be accomplished either using the SLURM configuration parameter MpiDefault=mvapich in slurm.conf or srun's --mpi=mvapich option.

$ mpicc ...
$ srun -n16 --mpi=mvapich a.out
NOTE: If MVAPICH is used in the shared memory model, with all tasks running on a single node, then use the mpich1_shmem MPI plugin instead.
NOTE (for system administrators): Configure PropagateResourceLimitsExcept=MEMLOCK in slurm.conf and start the slurmd daemons with an unlimited locked memory limit. For more details, see MVAPICH documentation for "CQ or QP Creation failure".


MVAPICH2

MVAPICH2 jobs can be launched directly by srun command. SLURM's none MPI plugin must be used to establish communications between the launched tasks. This can be accomplished either using the SLURM configuration parameter MpiDefault=none in slurm.conf or srun's --mpi=none option. The program must also be linked with SLURM's implementation of the PMI library so that tasks can communicate host and port information at startup. (The system administrator can add these option to the mpicc and mpif77 commands directly, so the user will not need to bother). Do not use SLURM's MVAPICH plugin for MVAPICH2.

$ mpicc -L<path_to_slurm_lib> -lpmi ...
$ srun -n16 --mpi=none a.out

BlueGene MPI

BlueGene MPI relies upon SLURM to create the resource allocation and then uses the native mpirun command to launch tasks. Build a job script containing one or more invocations of the mpirun command. Then submit the script to SLURM using sbatch. For example:

$ sbatch -N512 my.script

Note that the node count specified with the -N option indicates the base partition count. See BlueGene User and Administrator Guide for more information.


MPICH1

MPICH1 development ceased in 2005. It is recommended that you convert to MPICH2 or some other MPI implementation. If you still want to use MPICH1, note that it has several different programming models. If you are using the shared memory model (DEFAULT_DEVICE=ch_shmem in the mpirun script), then initiate the tasks using the srun command with the --mpi=mpich1_shmem option.

$ srun -n16 --mpi=mpich1_shmem a.out

NOTE: Using a configuration of MpiDefault=mpich1_shmem will result in one task being launched per node with the expectation that the MPI library will launch the remaining tasks based upon environment variables set by SLURM. Non-MPI jobs started in this configuration will lack the mechanism to launch more than one task per node unless srun's --mpi=none option is used.

If you are using MPICH P4 (DEFAULT_DEVICE=ch_p4 in the mpirun script) and SLURM version 1.2.11 or newer, then it is recommended that you apply the patch in the SLURM distribution's file contribs/mpich1.slurm.patch. Follow directions within the file to rebuild MPICH. Applications must be relinked with the new library. Initiate tasks using the srun command with the --mpi=mpich1_p4 option.

$ srun -n16 --mpi=mpich1_p4 a.out

Note that SLURM launches one task per node and the MPICH library linked within your applications launches the other tasks with shared memory used for communications between them. The only real anomaly is that all output from all spawned tasks on a node appear to SLURM as coming from the one task that it launched. If the srun --label option is used, the task ID labels will be misleading.

Other MPICH1 programming models current rely upon the SLURM salloc or sbatch command to allocate resources. In either case, specify the maximum number of tasks required for the job. You may then need to build a list of hosts to be used and use that as an argument to the mpirun command. For example:

$ cat mpich.sh
#!/bin/bash
srun hostname -s | sort -u >slurm.hosts
mpirun [options] -machinefile slurm.hosts a.out
rm -f slurm.hosts
$ sbatch -n16 mpich.sh
sbatch: Submitted batch job 1234

Note that in this example, mpirun uses the rsh command to launch tasks. These tasks are not managed by SLURM since they are launched outside of its control.

Last modified 15 October 2010