High Throughput Computing Administration Guide

This document contains SLURM administrator information specifically for high throughput computing, namely the execution of many short jobs. Getting optimal performance for high throughput computing does require some tuning and this document should help you off to a good start. A working knowledge of SLURM should be considered a prerequisite for this material.

Performance Results

SLURM has also been validated to process 100,000 jobs and job steps per hour on a sustained basis with short bursts of activity at a much higher level. Actual performance depends upon the jobs to be executed plus the hardware and configuration used.

System configuration

Three system configuration parameters must be set to support a large number of open files and TCP connections with large bursts of messages. Changes can be made using the /etc/rc.d/rc.local or /etc/sysctl.conf script to preserve changes after reboot. In either case, you can write values directly into these files (e.g. "echo 32832 > /proc/sys/fs/file-max").

The transmit queue length (txqueuelen) may also need to be modified using the ifconfig command. A value of 4096 has been found to work well for one site with a very large cluster (e.g. "ifconfig txqueuelen 4096").

User limits

The ulimit values in effect for the slurmctld daemon should be set quite high for memory size, open file count and stack size.

SLURM Configuration

Several SLURM configuration parameters should be adjusted to reflect the needs of high throughput computing.

Last modified 30 August 2010