SLURM User and Administrator Guide for Cray systems
User Guide
This document describes the unique features of SLURM on Cray computers. You should be familiar with the SLURM's mode of operation on Linux clusters before studying the differences in Cray system operation described in this document.
SLURM version 2.3 is designed to operate as a job scheduler over Cray's Application Level Placement Scheduler (ALPS). Use SLURM's sbatch or salloc commands to create a resource allocation in ALPS. Then use ALPS' aprun command to launch parallel jobs within the resource allocation. The resource allocation is terminated once the the batch script or the salloc command terminates. Alternately there is an aprun wrapper distributed with SLURM in contribs/cray/srun which will translate srun options into the equivalent aprun options. This wrapper will also execute salloc as needed to create a job allocation in which to run the aprun command. The srun script contains two new options: --man will print a summary of the options including notes about which srun options are not supported and --alps=" which can be used to specify aprun options which lack an equivalent within srun. For example, srun --alps="-a xt" -n 4 a.out. Since aprun is used to launch tasks (the equivalent of a SLURM job step), the job steps will not be visible using SLURM commands. Other than SLURM's srun command being replaced by aprun and the job steps not being visible, all other SLURM commands will operate as expected. Note that in order to build and install the aprun wrapper described above, execute "configure" with the --with-srun2aprun option or add %_with_srun2aprun 1 to your ~/.rpmmacros file.
Node naming and node geometry on Cray XT/XE systems
SLURM node names will be of the form "nid#####" where "#####" is a five-digit sequence number. Other information available about the node are it's XYZ coordinate in the node's NodeAddr field and it's component label in the HostNodeName field. The format of the component label is "c#-#c#s#n#" where the "#" fields represent in order: cabinet, row, cage, blade or slot, and node. For example "c0-1c2s5n3" is cabinet 0, row 1, cage 3, slot 5 and node 3.
Cray XT/XE systems come with a 3D torus by default. On smaller systems the cabling in X dimension is omitted, resulting in a two-dimensional torus (1 x Y x Z). On Gemini/XE systems, pairs of adjacent nodes (nodes 0/1 and 2/3 on each blade) share one network interface each. This causes the same Y coordinate to be assigned to those nodes, so that the number of distinct torus coordinates is half the number of total nodes.
The SLURM smap and sview tools can visualize node torus positions. Clicking on a particular node shows its NodeAddr field, which is its (X,Y,Z) torus coordinate base-36 encoded as a 3-character string. For example, a NodeAddr of '07A' corresponds to the coordinates X = 0, Y = 7, Z = 10. The NodeAddr of a node can also be shown using 'scontrol show node nid#####'.
Please note that the sbatch/salloc options "--geometry" and "--no-rotate" are BlueGene-specific and have no impact on Cray systems. Topological node placement depends on what Cray makes available via the ALPS_NIDORDER configuration option (see below).
Specifying thread depth
For threaded applications, use the --cpus-per-task/-c parameter of sbatch/salloc to set the thread depth per node. This corresponds to mppdepth in PBS and to the aprun -d parameter. Please note that SLURM does not set the OMP_NUM_THREADS environment variable. Hence, if an application spawns 4 threads, an example script would look like
#SBATCH --comment="illustrate the use of thread depth and OMP_NUM_THREADS" #SBATCH --ntasks=3 #SBATCH -c 4 export OMP_NUM_THREADS=4 aprun -n 3 -d $OMP_NUM_THREADS ./my_exe
Specifying number of tasks per node
SLURM uses the same default as ALPS, assigning each task to a single core/CPU. In order to make more resources available per task, you can reduce the number of processing elements per node (aprun -N parameter, mppnppn in PBS) with the --ntasks-per-node option of sbatch/salloc. This is in particular necessary when tasks require more memory than the per-CPU default.
Specifying per-task memory
In Cray terminology, a task is also called a "processing element" (PE), hence below we refer to the per-task memory and "per-PE" memory interchangeably. The per-PE memory requested through the batch system corresponds to the aprun -m parameter.
Due to the implicit default assumption that 1 task runs per core/CPU, the default memory available per task is the per-CPU share of node_memory / number_of_cores. For example, on a XT5 system with 16000MB per 12-core node, the per-CPU share is 1333MB.
If nothing else is specified, the --mem option to sbatch/salloc can only be used to reduce the per-PE memory below the per-CPU share. This is also the only way that the --mem-per-cpu option can be applied (besides, the --mem-per-cpu option is ignored if the user forgets to set --ntasks/-n). Thus, the preferred way of specifying memory is the more general --mem option.
To increase the per-PE memory settable via the --mem option requires making more per-task resources available using the --ntasks-per-node option to sbatch/salloc. This allows --mem to request up to node_memory / ntasks_per_node MegaBytes.
When --ntasks-per-node is 1, the entire node memory may be requested by the application. Setting --ntasks-per-node to the number of cores per node yields the default per-CPU share minimum value.
For all cases in between these extremes, set --mem=per_task_memory and
--ntasks-per-node=floor(node_memory / per_task_memory)
whenever per_task_memory needs to be larger than the per-CPU share.
Example: An application with 64 tasks needs 7500MB per task on a cluster with 32000MB and 24 cores per node. Hence ntasks_per_node = floor(32000/7500) = 4.
#SBATCH --comment="requesting 7500MB per task on 32000MB/24-core nodes" #SBATCH --ntasks=64 #SBATCH --ntasks-per-node=4 #SBATCH --mem=7500
If you would like to fine-tune the memory limit of your application, you can set the same parameters in a salloc session and then check directly, using
apstat -rvv -R $BASIL_RESERVATION_ID
to see how much memory has been requested.
Using aprun -B
CLE 3.x allows a nice aprun shortcut via the -B option, which reuses all the batch system parameters (--ntasks, --ntasks-per-node, --cpus-per-task, --mem) at application launch, as if the corresponding (-n, -N, -d, -m) parameters had been set; see the aprun(1) manpage on CLE 3.x systems for details.
Node ordering options
SLURM honours the node ordering policy set for Cray's Application Level Placement Scheduler (ALPS). Node ordering is a configurable system option (ALPS_NIDORDER in /etc/sysconfig/alps). The current setting is reported by 'apstat -svv' (look for the line starting with "nid ordering option") and can not be changed at runtime. The resulting, effective node ordering is revealed by 'apstat -no' (if no special node ordering has been configured, 'apstat -no' shows the same order as 'apstat -n').
SLURM uses exactly the same order as 'apstat -no' when selecting nodes for a job. With the --contiguous option to sbatch/salloc you can request a contiguous (relative to the current ALPS nid ordering) set of nodes. Note that on a busy system there is typically more fragmentation, hence it may take longer (or even prove impossible) to allocate contiguous sets of a larger size.
Cray/ALPS node ordering is a topic of ongoing work, some information can be found in the CUG-2010 paper "ALPS, Topology, and Performance" by Carl Albing and Mark Baker.
Administrator Guide
Install supporting RPMs
The build requires a few -devel RPMs listed below. You can obtain these from SuSe/Novell.
- CLE 2.x uses SuSe SLES 10 packages (rpms may be on the normal isos)
- CLE 3.x uses Suse SLES 11 packages (rpms are on the SDK isos, there are two SDK iso files for SDK)
You can check by logging onto the boot node and running
boot: # xtopview default: # rpm -qa
The list of packages that should be installed is:
- expat-2.0.xxx
- libexpat-devel-2.0.xxx
- cray-MySQL-devel-enterprise-5.0.64 (this should be on the Cray iso)
For example, loading MySQL can be done like this:
smw: # mkdir mnt smw: # mount -o loop, ro xe-sles11sp1-trunk.201107070231a03.iso mnt smw: # find mnt -name cray-MySQL-devel-enterprise\* mnt/craydist/xt-packages/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64.rpm smw: # scp mnt/craydist/xt-packages/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64
Then switch to boot node and run:
boot: # xtopview default: # rpm -ivh /software/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64.rpm default: # exit
All Cray-specific PrgEnv and compiler modules should be removed and root privileges will be required to install these files.
Create a build root
The build is done on a normal service node, where you like (e.g. /ufs/slurm/build would work). Most scripts check for the environment variable LIBROOT. You can either edit the scripts or export this variable. Easiest way:
login: # export LIBROOT=/ufs/slurm/build login: # mkdir -vp $LIBROOT login: # cd $LIBROOT
Install SLURM modulefile
This file is distributed as part the SLURM tar-ball in contribs/cray/opt_modulefiles_slurm. Install it as /opt/modulefiles/slurm (or anywhere else in your module path). It means that you can use Munge as soon as it is built.
login: # scp ~/slurm/contribs/cray/opt_modulefiles_slurm root@boot:/rr/current/software/
Build and install Munge
Note the Munge installation process on Cray systems differs somewhat from that described in the MUNGE Installation Guide.
Munge is the authentication daemon and needed by SLURM. Download munge-0.5.10.tar.bz2 or newer from http://code.google.com/p/munge/downloads/list. This is how one can build on a login node and install it.
login: # cd $LIBROOT login: # cp ~/slurm/contribs/cray/munge_build_script.sh $LIBROOT login: # mkdir -p ${LIBROOT}/munge/zip login: # curl -O http://munge.googlecode.com/files/munge-0.5.10.tar.bz2 login: # cp munge-0.5.10.tar.bz2 ${LIBROOT}/munge/zip login: # chmod u+x ${LIBROOT}/munge/zip/munge_build_script.sh login: # ${LIBROOT}/munge/zip/munge_build_script.sh (generates lots of output and enerates a tar-ball called $LIBROOT/munge_build-.*YYYY-MM-DD.tar.gz) login: # scp munge_build-2011-07-12.tar.gz root@boot:/rr/current/software
Install the tar-ball by on the boot node and build an encryption key file executing:
boot: # xtopview default: # tar -zxvf $LIBROOT/munge_build-*.tar.gz -C /rr/current / default: # dd if=/dev/urandom bs=1 count=1024 >/opt/slurm/munge/etc/munge.key default: # chmod go-rxw /opt/slurm/munge/etc/munge.key default: # exit
Configure Munge
The following steps apply to each login node and the sdb, where
- The slurmd or slurmctld daemon will run and/or
- Users will be submitting jobs
login: # mkdir --mode=0711 -vp /var/lib/munge login: # mkdir --mode=0700 -vp /var/log/munge login: # mkdir --mode=0755 -vp /var/run/munge login: # module load slurm
sdb: # mkdir --mode=0711 -vp /var/lib/munge sdb: # mkdir --mode=0700 -vp /var/log/munge sdb: # mkdir --mode=0755 -vp /var/run/munge
Start the munge daemon and test it.
login: # munged --key-file /opt/slurm/munge/etc/munge.key login: # munge -n MUNGE:AwQDAAAEy341MRViY+LacxYlz+mchKk5NUAGrYLqKRUvYkrR+MJzHTgzSm1JALqJcunWGDU6k3vpveoDFLD7fLctee5+OoQ4dCeqyK8slfAFvF9DT5pccPg=:
When done, verify network connectivity by executing:
- munge -n | ssh other-login-host /opt/slurm/munge/bin/unmunge
If you decide to keep the installation, you may be interested in automating the process using an init.d script distributed with the Munge. This should be installed on all nodes running munge, e.g., 'xtopview -c login' and 'xtopview -n sdbNodeID'
boot: # xtopview -c login login: # cp /software/etc_init_d_munge /etc/init.d/munge login: # chmod u+x /etc/init.d/munge login: # chkconfig munge on login: # exit boot: # xtopview -n 31 node/31: # cp /software/etc_init_d_munge /etc/init.d/munge node/31: # chmod u+x /etc/init.d/munge node/31: # chkconfig munge on node/31: # exit
Enable the Cray job service
This is a common dependency on Cray systems. ALPS relies on the Cray job service to generate cluster-unique job container IDs (PAGG IDs). These identifiers are used by ALPS to track running (aprun) job steps. The default (session IDs) is not unique across multiple login nodes. This standard procedure is described in chapter 9 of S-2393 and takes only two steps, both to be done on all 'login' class nodes (xtopview -c login):
- make sure that the /etc/init.d/job service is enabled (chkconfig) and started
- enable the pam_job.so module from /opt/cray/job/default in /etc/pam.d/common-session
(NB: the default pam_job.so is very verbose, a simpler and quieter variant is provided in contribs/cray.)
The latter step is required only if you would like to run interactive salloc sessions.
boot: # xtopview -c login login: # chkconfig job on login: # emacs -nw /etc/pam.d/common-session (uncomment the pam_job.so line) session optional /opt/cray/job/default/lib64/security/pam_job.so login: # exit boot: # xtopview -n 31 node/31:# chkconfig job on node/31:# emacs -nw /etc/pam.d/common-session (uncomment the pam_job.so line as shown above)
Build and Configure SLURM
SLURM can be built and installed as on any other computer as described Quick Start Administrator Guide. An example of building and installing SLURM version 2.3.0 is shown below.
NOTE: By default neither the salloc command or srun command wrapper can be executed as a background process. This is done for two reasons:
- Only one ALPS reservation can be created from each session ID. The salloc command can not change it's session ID without disconnecting itself from the terminal and its parent process, meaning the process could not be later put into the foreground or easily identified
- To better identify every process spawned under the salloc process using terminal foreground process group IDs
You can optionally enable salloc and srun to execute as background processes by using the configure option "--enable-salloc-background", however doing will result in failed resource allocations (error: Failed to allocate resources: Requested reservation is in use) if not executed sequentially and increase the likelyhood of orphaned processes.
login: # mkdir build && cd build login: # slurm/configure \ --prefix=/opt/slurm/2.3.0 \ --with-munge=/opt/slurm/munge/ \ --with-mysql_config=/opt/cray/MySQL/5.0.64-1.0000.2899.20.2.gem/bin \ --with-srun2aprun login: # make -j login: # mkdir install login: # make DESTDIR=/tmp/slurm/build/install install login: # make DESTDIR=/tmp/slurm/build/install install-contrib login: # cd install login: # tar czf slurm_opt.tar.gz opt login: # scp slurm_opt.tar.gz boot:/rr/current/software
boot: # xtopview default: # tar xzf /software/slurm_opt.tar.gz -C / default: # cd /opt/slurm/ default: # ln -s 2.3.0 default
When building SLURM's slurm.conf configuration file, use the NodeName parameter to specify all batch nodes to be scheduled. If nodes are defined in ALPS, but not defined in the slurm.conf file, a complete list of all batch nodes configured in ALPS will be logged by the slurmctld daemon when it starts. One would typically use this information to modify the slurm.conf file and restart the slurmctld daemon. Note that the NodeAddr and NodeHostName fields should not be configured, but will be set by SLURM using data from ALPS. NodeAddr be set to the node's XYZ coordinate and be used by SLURM's smap and sview commands. NodeHostName will be set to the node's component label. The format of the component label is "c#-#c#s#n#" where the "#" fields represent in order: cabinet, row, cate, blade or slot, and node. For example "c0-1c2s5n3" is cabinet 0, row 1, cage 3, slot 5 and node 3.
The slurmd daemons will not execute on the compute nodes, but will execute on one or more front end nodes. It is from here that batch scripts will execute aprun commands to launch tasks. This is specified in the slurm.conf file by using the FrontendName and optionally the FrontEndAddr fields as seen in the examples below.
Note that SLURM will by default kill running jobs when a node goes DOWN, while a DOWN node in ALPS only prevents new jobs from being scheduled on the node. To help avoid confusion, we recommend that SlurmdTimeout in the slurm.conf file be set to the same value as the suspectend parameter in ALPS' nodehealth.conf file.
You need to specify the appropriate resource selection plugin (the SelectType option in SLURM's slurm.conf configuration file). Configure SelectType to select/cray The select/cray plugin provides an interface to ALPS plus issues calls to the select/linear, which selects resources for jobs using a best-fit algorithm to allocate whole nodes to jobs (rather than individual sockets, cores or threads).
Note that the system topology is based upon information gathered from the ALPS database and is based upon the ALPS_NIDORDER configuration in /etc/sysconfig/alps. Excerpts of a slurm.conf file for use on a Cray systems follow:
#--------------------------------------------------------------------- # SLURM USER #--------------------------------------------------------------------- # SLURM user on cray systems must be root # This requirement derives from Cray ALPS: # - ALPS reservations can only be created by the job owner or root # (confirmation may be done by other non-privileged users) # - Freeing a reservation always requires root privileges SlurmUser=root #--------------------------------------------------------------------- # PLUGINS #--------------------------------------------------------------------- # Network topology (handled internally by ALPS) TopologyPlugin=topology/none # Scheduling SchedulerType=sched/backfill # Node selection: use the special-purpose "select/cray" plugin. # Internally this uses select/linar, i.e. nodes are always allocated # in units of nodes (other allocation is currently not possible, since # ALPS does not yet allow to run more than 1 executable on the same # node, see aprun(1), section LIMITATIONS). # # Add CR_memory as parameter to support --mem/--mem-per-cpu. SelectType=select/cray SelectTypeParameters=CR_Memory # Proctrack plugin: only/default option is proctrack/sgi_job # ALPS requires cluster-unique job container IDs and thus the /etc/init.d/job # service needs to be started on all slurmd and login nodes, as described in # S-2393, chapter 9. Due to this requirement, ProctrackType=proctrack/sgi_job # is the default on Cray and need not be specified explicitly. #--------------------------------------------------------------------- # PATHS #--------------------------------------------------------------------- SlurmdSpoolDir=/ufs/slurm/spool StateSaveLocation=/ufs/slurm/spool/state # main logfile SlurmctldLogFile=/ufs/slurm/log/slurmctld.log # slurmd logfiles (using %h for hostname) SlurmdLogFile=/ufs/slurm/log/%h.log # PIDs SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid #--------------------------------------------------------------------- # COMPUTE NODES #--------------------------------------------------------------------- # Return DOWN nodes to service when e.g. slurmd has been unresponsive ReturnToService=1 # Configure the suspectend parameter in ALPS' nodehealth.conf file to the same # value as SlurmdTimeout for consistent behavior (e.g. "suspectend: 600") SlurmdTimeout=600 # Controls how a node's configuration specifications in slurm.conf are # used. # 0 - use hardware configuration (must agree with slurm.conf) # 1 - use slurm.conf, nodes with fewer resources are marked DOWN # 2 - use slurm.conf, but do not mark nodes down as in (1) FastSchedule=2 # Per-node configuration for PALU AMD G34 dual-socket "Magny Cours" # Compute Nodes. We deviate from slurm's idea of a physical socket # here, since the Magny Cours hosts two NUMA nodes each, which is # also visible in the ALPS inventory (4 Segments per node, each # containing 6 'Processors'/Cores). NodeName=DEFAULT Sockets=4 CoresPerSocket=6 ThreadsPerCore=1 NodeName=DEFAULT RealMemory=32000 State=UNKNOWN # List the nodes of the compute partition below (service nodes are not # allowed to appear) NodeName=nid00[002-013,018-159,162-173,178-189] # Frontend nodes: these should not be available to user logins, but # have all filesystems mounted that are also # available on a login node (/scratch, /home, ...). FrontendName=palu[7-9] #--------------------------------------------------------------------- # ENFORCING LIMITS #--------------------------------------------------------------------- # Enforce the use of associations: {associations, limits, wckeys} AccountingStorageEnforce=limits # Do not propagate any resource limits from the user's environment to # the slurmd PropagateResourceLimits=NONE #--------------------------------------------------------------------- # Resource limits for memory allocation: # * the Def/Max 'PerCPU' and 'PerNode' variants are mutually exclusive; # * use the 'PerNode' variant for both default and maximum value, since # - slurm will automatically adjust this value depending on # --ntasks-per-node # - if using a higher per-cpu value than possible, salloc will just # block. #-------------------------------------------------------------------- # XXX replace both values below with your values from 'xtprocadmin -A' DefMemPerNode=32000 MaxMemPerNode=32000 #--------------------------------------------------------------------- # PARTITIONS #--------------------------------------------------------------------- # defaults common to all partitions PartitionName=DEFAULT Nodes=nid00[002-013,018-159,162-173,178-189] PartitionName=DEFAULT MaxNodes=178 PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60 # "User Support" partition with a higher priority PartitionName=usup Hidden=YES Priority=10 MaxTime=720 AllowGroups=staff # normal partition available to all users PartitionName=day Default=YES Priority=1 MaxTime=01:00:00
SLURM supports an optional cray.conf file containing Cray-specific configuration parameters. This file is NOT needed for production systems, but is provided for advanced configurations. If used, cray.conf must be located in the same directory as the slurm.conf file. Configuration parameters supported by cray.conf are listed below.
- apbasil
- Fully qualified pathname to the apbasil command. The default value is /usr/bin/apbasil.
- apkill
- Fully qualified pathname to the apkill command. The default value is /usr/bin/apkill.
- SDBdb
- Name of the ALPS database. The default value is XTAdmin.
- SDBhost
- Hostname of the database server. The default value is based upon the contents of the 'my.cnf' file used to store default database access information and that defaults to user 'sdb'.
- SDBpass
- Password used to access the ALPS database. The default value is based upon the contents of the 'my.cnf' file used to store default database access information and that defaults to user 'basic'.
- SDBport
- Port used to access the ALPS database. The default value is 0.
- SDBuser
- Name of user used to access the ALPS database. The default value is based upon the contents of the 'my.cnf' file used to store default database access information and that defaults to user 'basic'.
# Example cray.conf file apbasil=/opt/alps_simulator_40_r6768/apbasil.sh SDBhost=localhost SDBuser=alps_user SDBdb=XT5istanbul
One additional configuration script can be used to insure that the slurmd daemons execute with the highest resource limits possible, overriding default limits on Suse systems. Depending upon what resource limits are propagated from the user's environment, lower limits may apply to user jobs, but this script will insure that higher limits are possible. Copy the file contribs/cray/etc_sysconfig_slurm into /etc/sysconfig/slurm for these limits to take effect. This script is executed from /etc/init.d/slurm, which is typically executed to start the SLURM daemons. An excerpt of contribs/cray/etc_sysconfig_slurmis shown below.
# # /etc/sysconfig/slurm for Cray XT/XE systems # # Cray is SuSe-based, which means that ulimits from # /etc/security/limits.conf will get picked up any time SLURM is # restarted e.g. via pdsh/ssh. Since SLURM respects configured limits, # this can mean that for instance batch jobs get killed as a result # of configuring CPU time limits. Set sane start limits here. # # Values were taken from pam-1.1.2 Debian package ulimit -t unlimited # max amount of CPU time in seconds ulimit -d unlimited # max size of a process's data segment in KB
SLURM's init.d script should also be installed to automatically start SLURM daemons when nodes boot as shown below. Be sure to edit the script as appropriate to reference the proper file location (modify the variable PREFIX).
login: # scp /home/crayadm/ben/slurm/etc/init.d.slurm boot:/rr/current/software/
SLURM will ignore any interactive jobs or nodes in interactive mode so set all your nodes to batch from any service node. Dropping the -n option will make all nodes batch.
# xtprocadmin -k m batch -n NODEIDS
Now create the needed directories for logs and state files then start the daemons on the sdb and login nodes as shown below.
sdb: # mkdir -p /ufs/slurm/log sdb: # mkdir -p /ufs/slurm/spool sdb: # /etc/init.d/slurm start
login: # /etc/init.d/slurm start
Srun wrapper configuration
The srun wrapper to aprun might require modification to run as desired. Specifically the $aprun variable could be set to the absolute pathname of that executable file. Without that modification, the aprun command executed will depend upon the user's search path.
In order to debug the srun wrapper, uncomment the line
print "comment=$command\n"
If the srun wrapper is executed from within an existing SLURM job allocation (i.e. within salloc or an sbatch script), then it just executes the aprun command with appropriate options. If executed without an allocation, the wrapper executes salloc, which then executes the srun wrapper again. This second execution of the srun wrapper is required in order to process environment variables that are set by the salloc command based upon the resource allocation.
Last modified 13 March 2012