Quick Start Administrator Guide
Overview
Please see the Quick Start User Guide for a general overview.Super Quick Start
- Make sure that you have synchronized clocks plus consistent users and groups (UIDs and GIDs) across the cluster.
- Install MUNGE for authentication. Make sure that all nodes in your cluster have the same munge.key. Make sure the MUNGE daemon, munged is started before you start the SLURM daemons.
- bunzip2 the distributed tar-ball and untar the files:
tar --bzip -x -f slurm*tar.bz2 - cd to the directory containing the SLURM source and type ./configure with appropriate options, typically --prefix= and --sysconfdir=
- Type make to compile SLURM.
- Type make install to install the programs, documentation, libraries, header files, etc.
- Build a configuration file using your favorite web browser and
doc/html/configurator.html.
NOTE: The SlurmUser must be created as needed prior to starting SLURM and must exist on all nodes of the cluster.
NOTE: The parent directories for SLURM's log files, process ID files, state save directories, etc. are not created by SLURM. They must be created and made writable by SlurmUser as needed prior to starting SLURM daemons. - Install the configuration file in <sysconfdir>/slurm.conf.
NOTE: You will need to install this configuration file on all nodes of the cluster. - Start the slurmctld and slurmd daemons.
NOTE: Items 3 through 6 can be replaced with
- rpmbuild -ta slurm*.tar.bz2
- rpm --install <the rpm files>
Building and Installing SLURM
Instructions to build and install SLURM manually are shown below. See the README and INSTALL files in the source distribution for more details.
- bunzip2 the distributed tar-ball and untar the files: tar --bzip -x -f slurm*tar.bz2
- cd to the directory containing the SLURM source and type ./configure with appropriate options (see below).
- Type make to compile SLURM.
- Type make install to install the programs, documentation, libraries, header files, etc.
A full list of configure options will be returned by the command configure --help. The most commonly used arguments to the configure command include:
--enable-debug
Enable additional debugging logic within SLURM.
--prefix=PREFIX
Install architecture-independent files in PREFIX; default value is /usr/local.
--sysconfdir=DIR
Specify location of SLURM configuration file. The default value is PREFIX/etc
If required libraries or header files are in non-standard locations, set CFLAGS and LDFLAGS environment variables accordingly. Optional SLURM plugins will be built automatically when the configure script detects that the required build requirements are present. Build dependencies for various plugins and commands are denoted below.
- MUNGE The auth/munge plugin will be built if the MUNGE authentication library is installed. MUNGE is used as the default authentication mechanism.
- Authd The auth/authd plugin will be built and installed if the libauth library and its dependency libe are installed.
- Federation The switch/federation plugin will be built and installed if the IBM Federation switch library is installed.
- QsNet support in the form of the switch/elan plugin requires
that the qsnetlibs package (from Quadrics) be installed along
with its development counterpart (i.e. the qsnetheaders
package.) The switch/elan plugin also requires the
presence of the libelanosts library and /etc/elanhosts
configuration file. (See elanhosts(5) man page in that
package for more details). Define the nodes in the SLURM
configuration file slurm.conf in the same order as
defined in the elanhosts configuration file so that
node allocation for jobs can be performed so as to optimize
their performance. We highly recommend assigning the nodes
a numeric suffix equal to its Elan address for ease of
administration and because the Elan driver does not seem
to function otherwise
(e.g. /etc/elanhosts to contain two lines of this sort:
eip [0-15] linux[0-15]
eth [0-15] linux[0-15]
for fifteen nodes with a prefix of "linux" and numeric suffix between zero and 15). Finally, the "ptrack" kernel patch is required for process tracking. - sview The sview command will be built only if and gtk+-2.0 is installed
To build RPMs directly, copy the distributed tar-ball into the directory
/usr/src/redhat/SOURCES and execute a command of this sort (substitute
the appropriate SLURM version number):
rpmbuild -ta slurm-0.6.0-1.tar.bz2
You can control some aspects of the RPM built with a .rpmmacros file in your home directory. Special macro definitions will likely only be required if files are installed in unconventional locations. Some macro definitions that may be used in building SLURM include:
- _enable_debug
- Specify if debugging logic within SLURM is to be enabled
- _prefix
- Pathname of directory to contain the SLURM files
- slurm_sysconfdir
- Pathname of directory containing the slurm.conf configuration file
- with_munge
- Specifies the MUNGE (authentication library) installation location
- with_proctrack
- Specifies AIX process tracking kernel extension header file location
- with_ssl
- Specifies SSL library installation location
To build SLURM on our AIX system, the following .rpmmacros file is used:
# .rpmmacros # For AIX at LLNL # Override some RPM macros from /usr/lib/rpm/macros # Set SLURM-specific macros for unconventional file locations # %_enable_debug "--with-debug" %_prefix /admin/llnl %slurm_sysconfdir %{_prefix}/etc/slurm %_defaultdocdir %{_prefix}/doc %with_munge "--with-munge=/opt/freeware" %with_proctrack "--with-proctrack=/admin/llnl/include" %with_ssl "--with-ssl=/opt/freeware"
Daemons
slurmctld is sometimes called the "controller" daemon. It orchestrates SLURM activities, including queuing of jobs, monitoring node states, and allocating resources to jobs. There is an optional backup controller that automatically assumes control in the event the primary controller fails (see the High Availability section below). The primary controller resumes control whenever it is restored to service. The controller saves its state to disk whenever there is a change in state (see "StateSaveLocation" in Configuration section below). This state can be recovered by the controller at startup time. State changes are saved so that jobs and other state information can be preserved when the controller moves (to or from a backup controller) or is restarted.
We recommend that you create a Unix user slurm for use by slurmctld. This user name will also be specified using the SlurmUser in the slurm.conf configuration file. This user must exist on all nodes of the cluster for authentication of communications. Note that files and directories used by slurmctld will need to be readable or writable by the user SlurmUser (the slurm configuration files must be readable; the log file directory and state save directory must be writable).
The slurmd daemon executes on every compute node. It resembles a remote shell daemon to export control to SLURM. Because slurmd initiates and manages user jobs, it must execute as the user root.
If you want to archive job accounting records to a database, the slurmdbd (SLURM DataBase Daemon) should be used. We recommend that you defer adding accounting support until after basic SLURM functionality is established on your system. An Accounting web page contains more information.
slurmctld and/or slurmd should be initiated at node startup time per the SLURM configuration. A file etc/init.d/slurm is provided for this purpose. This script accepts commands start, startclean (ignores all saved state), restart, and stop.
High Availability
A backup controller can be configured (see "BackupController" in the Configuration section below) to take over for the primary slurmctld if it ever fails. The backup controller should be hosted on a node different from the node hosting the primary slurmctld. However, both hosts should mount a common file system containing the state information (see "StateSaveLocation" in the Configuration section below).
The backup controller detects when the primary fails and takes over for it. When the primary returns to service, it notifies the backup. The backup then saves state and returns to backup mode. The primary reads the saved state and resumes normal operation. Other than a brief period of non-responsiveness, the transition back and forth should go undetected.
Infrastructure
User and Group Identification
There must be a uniform user and group name space (including UIDs and GIDs) across the cluster. It is not necessary to permit user logins to the control hosts (ControlMachine or BackupController), but the users and groups must be configured on those hosts.
Authentication of SLURM communications
All communications between SLURM components are authenticated. The authentication infrastructure is provided by a dynamically loaded plugin chosen at runtime via the AuthType keyword in the SLURM configuration file. Currently available authentication types include authd, munge, and none. The default authentication infrastructure is "munge", but this does require the installation of the MUNGE package. An authentication type of "none" requires no infrastructure, but permits any user to execute any job as another user with limited programming effort. This may be fine for testing purposes, but certainly not for production use. Configure some AuthType value other than "none" if you want any security. We recommend the use of MUNGE unless you are experienced with authd. If using MUNGE, all nodes in the cluster must be configured with the same munge.key file. The MUNGE daemon, munged, must also be started before SLURM daemons.
While SLURM itself does not rely upon synchronized clocks on all nodes of a cluster for proper operation, its underlying authentication mechanism do have this requirement.
MPI support
SLURM supports many different SLURM implementations. For more information, see MPI.
Scheduler support
SLURM can be configured with rather simple or quite sophisticated scheduling algorithms depending upon your needs and willingness to manage the configuration (much of which requires a database). The first configuration parameter of interest is PriorityType with two options available: basic (first-in-first-out) and multifactor. The multifactor plugin will assign a priority to jobs based upon a multitude of configuration parameters (age, size, fair-share allocation, etc.) and its details are beyond the scope of this document. See the Multifactor Job Priority Plugin document for details.
The SchedType configuration parameter controls how queued jobs are scheduled and several options are available.
- builtin will initiate jobs strictly in their priority order, typically (first-in-first-out)
- backfill will initiate a lower-priority job if doing so does not delay the expected initiation time of higher priority jobs; essentially using smaller jobs to fill holes in the resource allocation plan. Effective backfill scheduling does require users to specify job time limits.
- gang time-slices jobs in the same partition/queue and can be used to preempt jobs from lower-priority queues in order to execute jobs in higher priority queues.
- wiki is an interface for use with The Maui Scheduler
- wiki2 is an interface for use with the Moab Cluster Suite
For more information about scheduling options see Gang Scheduling, Preemption, Resource Reservation Guide, Resource Limits and Sharing Consumable Resources.
Resource selection
The resource selection mechanism used by SLURM is controlled by the SelectType configuration parameter. If you want to execute multiple jobs per node, but apportion the processors, memory and other resources, the cons_res (consumable resources) plugin is recommended. If you tend to dedicate entire nodes to jobs, the linear plugin is recommended. For more information, please see Consumable Resources in SLURM. For BlueGene systems, bluegene plugin is required (it is topology aware and interacts with the BlueGene bridge API).
Logging
SLURM uses the syslog function to record events. It uses a range of importance levels for these messages. Be certain that your system's syslog functionality is operational.
Accounting
SLURM supports accounting records being written to a simple text file, directly to a database (MySQL or PostgreSQL), or to a daemon securely managing accounting data for multiple clusters. For more information see Accounting.
Corefile format
SLURM is designed to support generating a variety of core file formats for application codes that fail (see the --core option of the srun command). As of now, SLURM only supports a locally developed lightweight corefile library which has not yet been released to the public. It is expected that this library will be available in the near future.
Parallel debugger support
SLURM exports information for parallel debuggers using the specification detailed here. This is meant to be exploited by any parallel debugger (notably, TotalView), and support is unconditionally compiled into SLURM code.
The following lines should also be added to the global .tvdrc file for TotalView to operate with SLURM:
dset TV::parallel_configs { name: SLURM; description: SLURM; starter: srun %s %p %a; style: manager_process; tasks_option: -n; nodes_option: -N; env: ; force_env: false; }
Compute node access
SLURM does not by itself limit access to allocated compute nodes, but it does provide mechanisms to accomplish this. There is a Pluggable Authentication Module (PAM) for restricting access to compute nodes available for download. When installed, the SLURM PAM module will prevent users from logging into any node that has not be assigned to that user. On job termination, any processes initiated by the user outside of SLURM's control may be killed using an Epilog script configured in slurm.conf. An example of such a script is included as etc/slurm.epilog.clean. Without these mechanisms any user can login to any compute node, even those allocated to other users.
Configuration
The SLURM configuration file includes a wide variety of parameters. This configuration file must be available on each node of the cluster and must have consistent contents. A full description of the parameters is included in the slurm.conf man page. Rather than duplicate that information, a minimal sample configuration file is shown below. Your slurm.conf file should define at least the configuration parameters defined in this sample and likely additional ones. Any text following a "#" is considered a comment. The keywords in the file are not case sensitive, although the argument typically is (e.g., "SlurmUser=slurm" might be specified as "slurmuser=slurm"). The control machine, like all other machine specifications, can include both the host name and the name used for communications. In this case, the host's name is "mcri" and the name "emcri" is used for communications. In this case "emcri" is the private management network interface for the host "mcri". Port numbers to be used for communications are specified as well as various timer values.
The SlurmUser must be created as needed prior to starting SLURM. The parent directories for SLURM's log files, process ID files, state save directories, etc. are not created by SLURM. They must be created and made writable by SlurmUser as needed prior to starting SLURM daemons.
A description of the nodes and their grouping into partitions is required. A simple node range expression may optionally be used to specify ranges of nodes to avoid building a configuration file with large numbers of entries. The node range expression can contain one pair of square brackets with a sequence of comma separated numbers and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or "lx[15,18,32-33]"). On BlueGene systems only, the square brackets should contain pairs of three digit numbers separated by a "x". These numbers indicate the boundaries of a rectangular prism (e.g. "bgl[000x144,400x544]"). See our Blue Gene User and Administrator Guide for more details. Up to two numeric ranges can be included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or more numeric expressions are included, one of them must be at the end of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can always be used in a comma separated list.
Node names can have up to three name specifications: NodeName is the name used by all SLURM tools when referring to the node, NodeAddr is the name or IP address SLURM uses to communicate with the node, and NodeHostname is the name returned by the command /bin/hostname -s. Only NodeName is required (the others default to the same name), although supporting all three parameters provides complete control over naming and addressing the nodes. See the slurm.conf man page for details on all configuration parameters.
Nodes can be in more than one partition and each partition can have different constraints (permitted users, time limits, job size limits, etc.). Each partition can thus be considered a separate queue. Partition and node specifications use node range expressions to identify nodes in a concise fashion. This configuration file defines a 1154-node cluster for SLURM, but it might be used for a much larger cluster by just changing a few node range expressions. Specify the minimum processor count (CPUs), real memory space (RealMemory, megabytes), and temporary disk space (TmpDisk, megabytes) that a node should have to be considered available for use. Any node lacking these minimum configuration values will be considered DOWN and not scheduled. Note that a more extensive sample configuration file is provided in etc/slurm.conf.example. We also have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations.
# # Sample /etc/slurm.conf for mcr.llnl.gov # ControlMachine=mcri ControlAddr=emcri BackupController=mcrj BackupAddr=emcrj # AuthType=auth/munge Epilog=/usr/local/slurm/etc/epilog FastSchedule=1 JobCompLoc=/var/tmp/jette/slurm.job.log JobCompType=jobcomp/filetxt JobCredentialPrivateKey=/usr/local/etc/slurm.key JobCredentialPublicCertificate=/usr/local/etc/slurm.cert PluginDir=/usr/local/slurm/lib/slurm Prolog=/usr/local/slurm/etc/prolog SchedulerType=sched/backfill SelectType=select/linear SlurmUser=slurm SlurmctldPort=7002 SlurmctldTimeout=300 SlurmdPort=7003 SlurmdSpoolDir=/var/tmp/slurmd.spool SlurmdTimeout=300 StateSaveLocation=/tmp/slurm.state SwitchType=switch/elan TreeWidth=50 # # Node Configurations # NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000 State=UNKNOWN NodeName=mcr[0-1151] NodeAddr=emcr[0-1151] # # Partition Configurations # PartitionName=DEFAULT State=UP PartitionName=pdebug Nodes=mcr[0-191] MaxTime=30 MaxNodes=32 Default=YES PartitionName=pbatch Nodes=mcr[192-1151]
Security
Besides authentication of SLURM communications based upon the value of the AuthType, digital signatures are used in job step credentials. This signature is used by slurmctld to construct a job step credential, which is sent to srun and then forwarded to slurmd to initiate job steps. This design offers improved performance by removing much of the job step initiation overhead from the slurmctld daemon. The digital signature mechanism is specified by the CryptoType configuration parameter and the default mechanism is MUNGE.
OpenSSL
If using OpenSSL digital signatures, unique job credential keys must be created for your site using the program openssl. You must use openssl and not ssh-genkey to construct these keys. An example of how to do this is shown below. Specify file names that match the values of JobCredentialPrivateKey and JobCredentialPublicCertificate in your configuration file. The JobCredentialPrivateKey file must be readable only by SlurmUser. The JobCredentialPublicCertificate file must be readable by all users. Note that you should build the key files one on node and then distribute them to all nodes in the cluster. This insures that all nodes have a consistent set of digital signature keys. These keys are used by slurmctld to construct a job step credential, which is sent to srun and then forwarded to slurmd to initiate job steps.
openssl genrsa -out <sysconfdir>/slurm.key 1024
openssl rsa -in <sysconfdir>/slurm.key -pubout -out <sysconfdir>/slurm.cert
MUNGE
If using MUNGE digital signatures, no SLURM keys are required. This will be addressed in the installation and configuration of MUNGE.
Authentication
Authentication of communications (identifying who generated a particular message) between SLURM components can use a different security mechanism that is configurable. You must specify one "auth" plugin for this purpose using the AuthType configuration parameter. Currently, only three authentication plugins are supported: auth/none, auth/authd, and auth/munge. The auth/none plugin is built by default, but either Brent Chun's authd, or LLNL's MUNGE should be installed in order to get properly authenticated communications. Unless you are experience with authd, we recommend the use of MUNGE. The configure script in the top-level directory of this distribution will determine which authentication plugins may be built. The configuration file specifies which of the available plugins will be utilized.
Pluggable Authentication Module (PAM) support
A PAM module (Pluggable Authentication Module) is available for SLURM that can prevent a user from accessing a node which he has not been allocated, if that mode of operation is desired.
Starting the Daemons
For testing purposes you may want to start by just running slurmctld and slurmd on one node. By default, they execute in the background. Use the -D option for each daemon to execute them in the foreground and logging will be done to your terminal. The -v option will log events in more detail with more v's increasing the level of detail (e.g. -vvvvvv). You can use one window to execute "slurmctld -D -vvvvvv", a second window to execute "slurmd -D -vvvvv". You may see errors such as "Connection refused" or "Node X not responding" while one daemon is operative and the other is being started, but the daemons can be started in any order and proper communications will be established once both daemons complete initialization. You can use a third window to execute commands such as "srun -N1 /bin/hostname" to confirm functionality.
Another important option for the daemons is "-c" to clear previous state information. Without the "-c" option, the daemons will restore any previously saved state information: node state, job state, etc. With the "-c" option all previously running jobs will be purged and node state will be restored to the values specified in the configuration file. This means that a node configured down manually using the scontrol command will be returned to service unless also noted as being down in the configuration file. In practice, SLURM restarts with preservation consistently.
A thorough battery of tests written in the "expect" language is also available.
Administration Examples
scontrol can be used to print all system information and modify most of it. Only a few examples are shown below. Please see the scontrol man page for full details. The commands and options are all case insensitive.
Print detailed state of all jobs in the system.
adev0: scontrol scontrol: show job JobId=475 UserId=bob(6885) Name=sleep JobState=COMPLETED Priority=4294901286 Partition=batch BatchFlag=0 AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED StartTime=03/19-12:53:41 EndTime=03/19-12:53:59 NodeList=adev8 NodeListIndecies=-1 NumCPUs=0 MinNodes=0 Shared=0 Contiguous=0 MinCPUs=0 MinMemory=0 Features=(null) MinTmpDisk=0 ReqNodeList=(null) ReqNodeListIndecies=-1 JobId=476 UserId=bob(6885) Name=sleep JobState=RUNNING Priority=4294901285 Partition=batch BatchFlag=0 AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED StartTime=03/19-12:54:01 EndTime=NONE NodeList=adev8 NodeListIndecies=8,8,-1 NumCPUs=0 MinNodes=0 Shared=0 Contiguous=0 MinCPUs=0 MinMemory=0 Features=(null) MinTmpDisk=0 ReqNodeList=(null) ReqNodeListIndecies=-1
Print the detailed state of job 477 and change its priority to zero. A priority of zero prevents a job from being initiated (it is held in "pending" state).
adev0: scontrol scontrol: show job 477 JobId=477 UserId=bob(6885) Name=sleep JobState=PENDING Priority=4294901286 Partition=batch BatchFlag=0 more data removed.... scontrol: update JobId=477 Priority=0
Print the state of node adev13 and drain it. To drain a node specify a new state of DRAIN, DRAINED, or DRAINING. SLURM will automatically set it to the appropriate value of either DRAINING or DRAINED depending on whether the node is allocated or not. Return it to service later.
adev0: scontrol scontrol: show node adev13 NodeName=adev13 State=ALLOCATED CPUs=2 RealMemory=3448 TmpDisk=32000 Weight=16 Partition=debug Features=(null) scontrol: update NodeName=adev13 State=DRAIN scontrol: show node adev13 NodeName=adev13 State=DRAINING CPUs=2 RealMemory=3448 TmpDisk=32000 Weight=16 Partition=debug Features=(null) scontrol: quit Later adev0: scontrol scontrol: show node adev13 NodeName=adev13 State=DRAINED CPUs=2 RealMemory=3448 TmpDisk=32000 Weight=16 Partition=debug Features=(null) scontrol: update NodeName=adev13 State=IDLE
Reconfigure all SLURM daemons on all nodes. This should be done after changing the SLURM configuration file.
adev0: scontrol reconfig
Print the current SLURM configuration. This also reports if the primary and secondary controllers (slurmctld daemons) are responding. To just see the state of the controllers, use the command ping.
adev0: scontrol show config Configuration data as of 03/19-13:04:12 AuthType = auth/munge BackupAddr = eadevj BackupController = adevj BOOT_TIME = 01/10-09:19:21 CacheGroups = 0 CheckpointType = checkpoint/none ControlAddr = eadevi ControlMachine = adevi ... WaitTime = 0 Slurmctld(primary/backup) at adevi/adevj are UP/UP
Shutdown all SLURM daemons on all nodes.
adev0: scontrol shutdown
OS X, Darwin
Build using the following execute line:
sh configure && MACOSX_DEPLOYMENT_TARGET=10.5 make all
Testing
An extensive test suite is available within the SLURM distribution in testsuite/expect. There are about 250 tests which will execute on the order of 2000 jobs and 5000 job steps. Depending upon your system configuration and performance, this test suite will take roughly 80 minutes to complete. The file testsuite/expect/globals contains default paths and procedures for all of the individual tests. You will need to edit this file to specify where SLURM and other tools are installed. Set your working directory to testsuite/expect before starting these tests. Tests may be executed individually by name (e.g. test1.1) or the full test suite may be executed with the single command regression. See testsuite/expect/README for more information.
Upgrades
Background: The SLURM version numbers contain three digits, which represent the major, minor and micro release numbers in that order (e.g. 2.1.3 is major=2, minor=1, micro=3). Changes in the RPCs (remote procedure calls) will only be made if the major and/or minor relase number changes. Changes in the micro release number generally represent only bug fixes, but may also include minor enhancements.
If the SlurmDBD daemon is used, it must be at the same or higher minor release number as the Slurmctld daemons. In other words, when changing the version to a higher release number (e.g from 2.0 to 2.1) always upgrade the SlurmDBD daemon first. There is no need to upgrade the SlurmDBD daemon when performing a n update at the micro level (e.g. from 2.1.0 to 2.1.1).
When upgrading to a new major or minor release of SLURM prior to version 2.2 (e.g. 2.0.x to 2.1.x) all running and pending jobs will be purged due to changes in state save information. When upgrading to a new micro release of SLURM (e.g. 2.1.1 to 2.1.2) all running and pending jobs will be preserved. Just install a new version of SLURM and restart the daemons. When going from version 2.1.x to version 2.2.x and higher version numbers, we do not expect that any running or pending jobs will be lost although a limited number of prior releases may be supported (e.g. 2.1.0 to 2.2.0 will work fine, but 2.1.0 to 2.9.0 may not). An exception to this is that jobs may be lost when installing new pre-release versions (e.g. 2.3.0-pre1 to 2.3.0-pre2). We'll try to note these cases in the NEWS file. Contents of major releases are also described in the RELEASE_NOTES file.
Last modified 9 November 2010