SLURM: A Highly Scalable Resource Manager
SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
SLURM's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and is used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.
While other resource managers do exist, SLURM is unique in several respects:
- It is designed to operate in a heterogeneous cluster with up to 65,536 nodes and hundreds of thousands of processors.
- It can sustain a throughput rate of over 120,000 jobs per hour with bursts of job submissions at several times that rate.
- Its source code is freely available under the GNU General Public License.
- It is portable; written in C with a GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
- It is highly tolerant of system failures, including failure of the node executing its control functions.
- A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
SLURM provides resource management on many of the most powerful computers in the world including:
- Tianhe-1A designed by The National University of Defence Technology (NUDT) in China with 14,336 Intel CPUs and 7,168 NVDIA Tesla M2050 GPUs, with a peak performance of 2.507 Petaflops.
- Tera 100 at CEA with 140,000 Intel Xeon 7500 processing cores, 300TB of central memory and a theoretical computing power of 1.25 Petaflops. Europe's most powerful supercomputer.
- Dawn, a BlueGene/P system at LLNL with 147,456 PowerPC 450 cores with a peak performance of 0.5 Petaflops.
- Rosa, a CRAY XT5 at the Swiss National Supercomputer Centre named after Monte Rosa in the Swiss-Italian Alps, elevation 4,634m. 3,688 AMD hexa-core Opteron @ 2.4 GHz, 28.8 TB DDR2 RAM, 290 TB Disk, 9.6 GB/s interconnect bandwidth (Seastar).
- EKA at Computational Research Laboratories, India with 14,240 Xeon processors and Infiniband interconnect
- MareNostrum a Linux cluster at the Barcelona Supercomputer Center with 10,240 PowerPC processors and a Myrinet switch
- Anton a massively parallel supercomputer designed and built by D. E. Shaw Research for molecular dynamics simulation using 512 custom-designed ASICs and a three-dimensional torus interconnect.
Last modified 5 May 2011