Power Saving Guide

SLURM provides an integrated power saving mechanism for idle nodes. Nodes that remain idle for a configurable period of time can be placed in a power saving mode. The nodes will be restored to normal operation once work is assigned to them. Beginning with version 2.0.0, nodes can be fully powered down. Earlier releases of SLURM do not support the powering down of nodes, only support of reducing their performance and thus their power consumption. For example, power saving can be accomplished using a cpufreq governor that can change CPU frequency and voltage (note that the cpufreq driver must be enabled in the Linux kernel configuration). Of particular note, SLURM can power nodes up or down at a configurable rate to prevent rapid changes in power demands. For example, starting a 1000 node job on an idle cluster could result in an instantaneous surge in power demand of multiple megawatts without SLURM's support to increase power demands in a gradual fashion.

Configuration

A great deal of flexibility is offered in terms of when and how idle nodes are put into or removed from power save mode. Note that the SLURM control daemon, slurmctld, must be restarted to initially enable power saving mode. Changes in the configuration parameters (e.g. SuspendTime) will take effect after modifying the slurm.conf configuration file and executing "scontrol reconfig". The following configuration parameters are available:

Note that SuspendProgram and ResumeProgram execute as SlurmUser on the node where the slurmctld daemon runs (primary and backup server nodes). Use of sudo may be required for SlurmUserto power down and restart nodes. If you need to convert SLURM's hostlist expression into individual node names, the scontrol show hostnames command may prove useful. The commands used to boot or shut down nodes will depend upon your cluster management tools.

Note that SuspendProgram and ResumeProgram are not subject to any time limits. They should perform the required action, ideally verify the action (e.g. node boot and start the slurmd daemon, thus the node is no longer non-responsive to slurmctld) and terminate. Long running programs will be logged by slurmctld, but not aborted.

Also note that the stderr/out of the suspend and resume programs are not logged. If logging is desired it should be added to the scripts.

#!/bin/bash
# Example SuspendProgram
echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   sudo node_shutdown $host
done

#!/bin/bash
# Example ResumeProgram
echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   sudo node_startup $host
done

Subject to the various rates, limits and exclusions, the power save code follows this logic:

  1. Identify nodes which have been idle for at least SuspendTime.
  2. Execute SuspendProgram with an argument of the idle node names.
  3. Identify the nodes which are in power save mode (a flag in the node's state field), but have been allocated to jobs.
  4. Execute ResumeProgram with an argument of the allocated node names.
  5. Once the slurmd responds, initiate the job and/or job steps allocated to it.
  6. If the slurmd fails to respond within the value configured for SlurmdTimeout, the node will be marked DOWN and the job requeued if possible.
  7. Repeat indefinitely.

The slurmctld daemon will periodically (every 10 minutes) log how many nodes are in power save mode using messages of this sort:

[May 02 15:31:25] Power save mode 0 nodes
...
[May 02 15:41:26] Power save mode 10 nodes
...
[May 02 15:51:28] Power save mode 22 nodes

Using these logs you can easily see the effect of SLURM's power saving support. You can also configure SLURM with programs that perform no action as SuspendProgram and ResumeProgram to assess the potential impact of power saving mode before enabling it.

Use of Allocations

A resource allocation request will be granted as soon as resources are selected for use, possibly before the nodes are all available for use. The launching of job steps will be delayed until the required nodes have been restored to service (it prints a warning about waiting for nodes to become available and periodically retries until they are available).

In the case of an sbatch command, the batch program will start when node zero of the allocation is ready for use and pre-processing can be performed as needed before using srun to launch job steps. Waiting for all nodes to be booted can be accomplished by adding the command "scontrol wait_job $SLURM_JOBID" within the script or by adding that command to the the system Prolog or PrologSlurmctld as configured in slurm.conf, which would create the delay for all jobs on the system. Insure that the Prolog code is zero to avoid draining the node (do not use the scontrol exit code to avoid draining the node on error, for example if the job is explicitly cancelled during startup). Note that the scontrol wait_job command was added to SLURM version 2.2. When using earlier versions of SLURM, one may execute "srun /bin/true" or some other command first to insure that all nodes are booted and ready for use.

The salloc and srun commands which create a resource allocation automatically wait for the nodes to power up in SLURM version 2.2. When using earlier versions of SLURM, salloc will return immediately after a resource allocation is made and one can execute "srun /bin/true" to insure that all nodes are booted and ready for use.

Fault Tolerance

If the slurmctld daemon is terminated gracefully, it will wait up to SuspendTimeout or ResumeTimeout (whichever is larger) for any spawned SuspendProgram or ResumeProgram to terminate before the daemon terminates. If the spawned program does not terminate within that time period, the event will be logged and slurmctld will exit in order to permit another slurmctld daemon to be initiated. Synchronization problems could also occur when the slurmctld daemon crashes (a rare event) and is restarted.

In either event, the newly initiated slurmctld daemon (or the backup server) will recover saved node state information that may not accurately describe the actual node state. In the case of a failed SuspendProgram, the negative impact is limited to increased power consumption, so no special action is currently taken to execute SuspendProgram multiple times in order to insure the node is in a reduced power mode. The case of a failed ResumeProgram is more serious in that the node could be placed into a DOWN state and/or jobs could fail. In order to minimize this risk, when the slurmctld daemon is started and node which should be allocated to a job fails to respond, the ResumeProgram will be executed (possibly for a second time).

Booting Different Images

SLURM's PrologSlurmctld configuration parameter can identify a program to boot different operating system images for each job based upon it's constraint field (or possibly comment). If you want ResumeProgram to boot a various images according to job specifications, it will need to be a fairly sophisticated program and perform the following actions:

  1. Determine which jobs are associated with the nodes to be booted
  2. Determine which image is required for each job and
  3. Boot the appropriate image for each node

Last modified 28 April 2010