High Throughput Computing Administration Guide
This document contains SLURM administrator information specifically for high throughput computing, namely the execution of many short jobs. Getting optimal performance for high throughput computing does require some tuning and this document should help you off to a good start. A working knowledge of SLURM should be considered a prerequisite for this material.
Performance Results
SLURM has also been validated to process 100,000 jobs and job steps per hour on a sustained basis with short bursts of activity at a much higher level. Actual performance depends upon the jobs to be executed plus the hardware and configuration used.
System configuration
Three system configuration parameters must be set to support a large number of open files and TCP connections with large bursts of messages. Changes can be made using the /etc/rc.d/rc.local or /etc/sysctl.conf script to preserve changes after reboot. In either case, you can write values directly into these files (e.g. "echo 32832 > /proc/sys/fs/file-max").
- /proc/sys/fs/file-max: The maximum number of concurrently open files. We recommend a limit of at least 32,832.
- /proc/sys/net/ipv4/tcp_max_syn_backlog: Maximum number of remembered connection requests, which are still did not receive an acknowledgment from connecting client. The default value is 1024 for systems with more than 128Mb of memory, and 128 for low memory machines. If server suffers of overload, try to increase this number.
- /proc/sys/net/core/somaxconn: Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to 128. The value should be raised substantially to support bursts of request. For example, to support a burst of 1024 requests, set somaxconn to 1024.
The transmit queue length (txqueuelen) may also need to be modified
using the ifconfig command. A value of 4096 has been found to work well for one
site with a very large cluster
(e.g. "ifconfig
User limits
The ulimit values in effect for the slurmctld daemon should be set quite high for memory size, open file count and stack size.
SLURM Configuration
Several SLURM configuration parameters should be adjusted to reflect the needs of high throughput computing.
- MaxJobCount: Controls how many jobs may be in the slurmctld daemon records at any point in time (pending, running, suspended or completed[temporarily]). The default value is 10,000
- MessageTimeout: Controls how long to wait for a response to messages. The default value is 10 seconds. While the slurmctld daemon is highly threaded, its responsiveness is load dependent. This value might need to be increased somewhat.
- MinJobAge: Controls how soon the record of a completed job can be purged from the slurmctld memory and thus not visible using the squeue command. The record of jobs run will be preserved in accounting records and logs. The default value is 300 seconds. The value should be reduced to a few seconds if possible.
- SchedulerParameters:
Several scheduling parameters are available.
- Setting option defer will avoid attempting to schedule each job individually at job submit time, but defer it until a later time when scheduling multiple jobs simultaneously may be possible. This option may improve system responsiveness when large numbers of jobs (many hundreds) are submitted at the same time, but it will delay the initiation time of individual jobs.
- A variation of defer would be to configure default_queue_depth to a relatively small number to avoid attempting to schedule large numbers of jobs every time some job completes or another routine action occurs. (NOTE: the default value of default_queue_depth should be fine in most cases).
- The sched/backfill plugin has relatively high overhead if used with large numbers of job. Configuring max_job_bf to a modest size (say 100 jobs or less) and interval to 30 seconds or more will limit the overhead of backfill scheduling (NOTE: the default values are fine for both of these parameters).
- SlurmctldPort: It is desirable to configure the slurmctld daemon to accept incoming messages on more than one port in order to avoid having incoming messages discarded by the operating system due to exceeding the SOMAXCONN limit described above. Using between two and ten ports is suggested when large numbers of simultaneous requests are to be supported.
- Other: Configure logging, accounting and other overhead to a minimum appropriate for your environment.
Last modified 30 August 2010