BlueGene User and Administrator Guide

Overview

This document describes the unique features of SLURM on the IBM BlueGene systems. You should be familiar with SLURM's mode of operation on Linux clusters before studying the relatively few differences in BlueGene operation described in this document.

BlueGene systems have several unique features making for a few differences in how SLURM operates there. BlueGene systems consists of one or more base partitions or midplanes connected in a three-dimensional (BlueGene/L and BlueGene/P systems) or five-dimensional (BlueGene/Q) torus. Each base partition typically includes 512 c-nodes or compute nodes each containing two or more cores; one core is typically designed primarily for managing communications while the other cores are used primarily for computations. Each c-node can execute only one process and thus are unable to execute both the user's application plus SLURM's slurmd daemon. Thus the slurmd daemon(s) executes on one or more of the BlueGene Front End Nodes. The slurmd daemons provide (almost) all of the normal SLURM services for every base partition on the system.

Internally SLURM treats each base partition as one node with a processor count equal to the number of cores on the base partition, which keeps the number of entities being managed by SLURM more reasonable. Since the current BlueGene software can sub-allocate a base partition into smaller blocks, more than one user job can execute on each base partition (subject to system administrator configuration). In the case of BlueGene/Q systems, more than one user job can also execute in each block. To effectively utilize this environment, SLURM tools present the user with the view that each c-node is a separate node, so allocation requests and status information use c-node counts. Since the c-node count can be very large, the suffix "k" can be used to represent multiples of 1024 or "m" for multiples of 1,048,576 (1024 x 1024). For example, "2k" is equivalent to "2048".

User Tools

The normal set of SLURM user tools: sbatch, scancel, sinfo, squeue, and scontrol provide all of the expected services except support for job steps, which is detailed later. Seven new sbatch options are available: --geometry (specify job size in each dimension), --no-rotate (disable rotation of geometry), --conn-type (specify interconnect type between base partitions, mesh or torus). --blrts-image (specify alternative blrts image for bluegene --block. Default if not set, BGL only.) --cnload-image (specify alternative c-node image for bluegene block. Default if not set, BGP only.) --ioload-image (specify alternative io image for bluegene block. Default if not set, BGP only.) --linux-image (specify alternative linux image for bluegene block. Default if not set, BGL only.) --mloader-image (specify alternative mloader image for bluegene block. Default if not set). --ramdisk-image (specify alternative ramdisk image for bluegene block. Default if not set, BGL only.) The --nodes option with a minimum and (optionally) maximum node count continues to be available. Note that this is a c-node count.

Task Launch on BlueGene/Q only

Use SLURM's srun command to launch tasks (srun is a wrapper for IBM's runjob command. SLURM job step information including accounting functions as expected.

Task Launch on BlueGene/L and BlueGene/P only

SLURM performs resource allocation for the job, but initiation of tasks is performed using the mpirun command. SLURM has no concept of a job step on BlueGene/L or BlueGene/P systems. To reiterate: salloc or sbatch are used to create a job allocation, but mpirun is used to launch the parallel tasks. The script that you submit to SLURM can contain multiple invocations of mpirun as well as any desired commands for pre- and post-processing. The mpirun command will get its bgblock information from the MPIRUN_PARTITION as set by SLURM. A sample script is shown below.

#!/bin/bash
# pre-processing
date
# processing
mpirun -exec /home/user/prog -cwd /home/user -args 123
mpirun -exec /home/user/prog -cwd /home/user -args 124
# post-processing
date

Naming Conventions

The naming of base partitions includes a numeric suffix representing the its coordinates with a zero origin. The suffix contains three digits on BlueGene/L and BlueGene/P systems, while four digits are required for the BlueGene/Q systems. For example, "bgp012" represents the base partition whose coordinate is at X=0, Y=1 and Z=2. SLURM uses an abbreviated format for describing base partitions in which the end-points of the block enclosed are in square-brackets and separated by an "x". For example, "bgp[620x731]" is used to represent the eight base partitions enclosed in a block with end-points and bgp620 and bgp731 (bgp620, bgp621, bgp630, bgp631, bgp720, bgp721, bgp730 and bgp731).

IMPORTANT: SLURM higher can support up to 36 elements in each BlueGene dimension by supporting "A-Z" as valid numbers. SLURM requires the prefix to be lower case and any letters in the suffix must always be upper case. This schema must be used in both the slurm.conf and bluegene.conf configuration files when specifying midplane/node names (the prefix is optional). This schema should also be used to specify midplanes or locations in configure mode of smap:
valid: bgl[000xC44], bgl000, bglZZZ
invalid: BGL[000xC44], BglC00, bglb00, Bglzzz

In a system configured with small blocks (any block less than a full base partition) there will be divisions in the base partition notation. On BlueGene/L and BlueGene/P systems, the base partition name may be followed by a square bracket enclosing ID numbers of the IO nodes associated with the block. For example, if there are 64 psets in a BlueGene/L configuration, "bgl012[0-15]" represents the first quarter or first 16 IO nodes of a midplane. In BlueGene/L this would be 128 c-node block. To represent the first nodecard in the second quarter or IO nodes 16-19, the notation would be "bgl012[16-19]", or a 32 c-node block. On BlueGene/Q systems, the specific c-nodes would be identified in square brackets using their five digit coordinates. For example "bgq0123[00000x11111]" would represent the 32 c-nodes in midplane "bgq0123" having coordinates (within that midplane) from zero to one in each of the five dimensions.

Two topology-aware graphical user interfaces are provided: smap and sview (sview provides more viewing and configuring options). See each command's man page for details. A sample of smap output is provided below showing the location of five jobs. Note the format of the list of base partitions allocated to each job. Also note that idle (unassigned) base partitions are indicated by a period. Down and drained base partitions (those not available for use) are indicated by a number sign (bg703 in the display below). The legend is for illustrative purposes only. The origin (zero in every dimension) is shown at the rear left corner of the bottom plane. Each set of four consecutive lines represents a plane in the Y dimension. Values in the X dimension increase to the right. Values in the Z dimension increase down and toward the left.

   a a a a b b d d    ID JOBID PARTITION BG_BLOCK USER   NAME ST TIME NODES BP_LIST
  a a a a b b d d     a  12345 batch     RMP0     joseph tst1 R  43:12  32k bg[000x333]
 a a a a b b c c      b  12346 debug     RMP1     chris  sim3 R  12:34   8k bg[420x533]
a a a a b b c c       c  12350 debug     RMP2     danny  job3 R   0:12   4k bg[622x733]
                      d  12356 debug     RMP3     dan    colu R  18:05   8k bg[600x731]
   a a a a b b d d    e  12378 debug     RMP4     joseph asx4 R   0:34   2k bg[612x713]
  a a a a b b d d
 a a a a b b c c
a a a a b b c c

   a a a a . . d d
  a a a a . . d d
 a a a a . . e e              Y
a a a a . . e e               |
                              |
   a a a a . . d d            0----X
  a a a a . . d d            /
 a a a a . . . .            /
a a a a . . . #            Z

Note that jobs enter the SLURM state RUNNING as soon as the have been allocated a bgblock. If the bgblock is in a READY state, the job will begin execution almost immediately. Otherwise the execution of the job will not actually begin until the bgblock is in a READY state, which can require booting the block and a delay of minutes to do so. You can identify the bgblock associated with your job using the command smap -Dj -c and the state of the bgblock with the command smap -Db -c. The time to boot a bgblock is related to its size, but should range from from a few minutes to about 15 minutes for a bgblock containing 128 base partitions. Only after the bgblock is READY will your job's output file be created and the script execution begin. If the bgblock boot fails, SLURM will attempt to reboot several times before draining the associated base partitions and aborting the job.

The job will continue to be in a RUNNING state until the bgjob has completed and the bgblock ownership is changed. The time for completing a bgjob has frequently been on the order of five minutes. In summary, your job may appear in SLURM as RUNNING for 15 minutes before the script actually begins to 5 minutes after it completes. These delays are the result of the BlueGene infrastructure issues and are not due to anything in SLURM.

When using smap in default output mode you can scroll through the different windows using the arrow keys. The up and down arrow keys scroll the window containing the grid, and the left and right arrow keys scroll the window containing the text information.

System Administration

Building a BlueGene compatible system is dependent upon the configure program locating some expected files. In particular for a BlueGene/L system, the configure script searches for libdb2.so in the directories /home/bgdb2cli/sqllib and /u/bgdb2cli/sqllib. If your DB2 library file is in a different location, use the configure option --with-db2-dir=PATH to specify the parent directory. If you have the same version of the operating system on both the Service Node (SN) and the Front End Nodes (FEN) then you can configure and build one set of files on the SN and install them on both the SN and FEN. Note that all smap functionality will be provided on the FEN except for the ability to map SLURM node names to and from row/rack/midplane data, which requires direct use of the Bridge API calls only available on the SN.

If you have different versions of the operating system on the SN and FEN (as was the case for some early system installations), then you will need to configure and build two sets of files for installation. One set will be for the Service Node (SN), which has direct access to the Bridge APIs. The second set will be for the Front End Nodes (FEN), which lack access to the Bridge APIs and interact with using Remote Procedure Calls to the slurmctld daemon. You should see "#define HAVE_BG 1" and "#define HAVE_FRONT_END 1" in the "config.h" file for both the SN and FEN builds. You should also see "#define HAVE_BG_FILES 1" in config.h on the SN before building SLURM.

The slurmctld daemon should execute on the system's service node. If an optional backup daemon is used, it must be in some location where it is capable of executing Bridge APIs. The slurmd daemons executes the user scripts and there must be at least one front end node configured for this purpose. Multiple front end nodes may be configured for slurmd use to improve performance and fault tolerance. Each slurmd can execute jobs for every base partition and the work will be distributed among the slurmd daemons to balance the workload. You can use the scontrol command to drain individual compute nodes as desired and return them to service.

The slurm.conf (configuration) file needs to have the value of InactiveLimit set to zero or not specified (it defaults to a value of zero). This is because if there are no job steps, we don't want to purge jobs prematurely. The value of SelectType must be set to "select/bluegene" in order to have node selection performed using a system aware of the system's topography and interfaces. The value of Prolog should be set to the full pathname of a program that will delay execution until the bgblock identified by the MPIRUN_PARTITION environment variable is ready for use. It is recommended that you construct a script that serves this function and calls the supplied program sbin/slurm_prolog. The value of Epilog should be set to the full pathname of a program that will wait until the bgblock identified by the MPIRUN_PARTITION environment variable is no longer usable by this job. It is recommended that you construct a script that serves this function and calls the supplied program sbin/slurm_epilog. The prolog and epilog programs are used to insure proper synchronization between the slurmctld daemon, the user job, and MMCS. A multitude of other functions may also be placed into the prolog and epilog as desired (e.g. enabling/disabling user logins, purging file systems, etc.). Sample prolog and epilog scripts follow.

#!/bin/bash
# Sample BlueGene Prolog script
#
# Wait for bgblock to be ready for this job's use
/usr/sbin/slurm_prolog
#!/bin/bash
# Sample BlueGene Epilog script
#
# Cancel job to start the termination process for this job
# and release the bgblock
/usr/bin/scancel $SLURM_JOB_ID
#
# Wait for bgblock to be released from this job's use
/usr/sbin/slurm_epilog

Since jobs with different geometries or other characteristics might not interfere with each other, scheduling is somewhat different on a BlueGene system than typical clusters. SLURM's builtin scheduler on BlueGene will sort pending jobs and then attempt to schedule all of them in priority order. This essentially functions as if there is a separate queue for each job size. SLURM's backfill scheduler on BlueGene will enforce FIFO (first-in first-out) scheduling with backfill (lower priority jobs will start early if doing so will not impact the expected initiation time of a higher priority job). As on other systems, effective backfill relies upon users setting reasonable job time limits. Note that SLURM does support different partitions with an assortment of different scheduling parameters. For example, SLURM can have defined a partition for full system jobs that is enabled to execute jobs only at certain times; while a default partition could be configured to execute jobs at other times. Jobs could still be queued in a partition that is configured in a DOWN state and scheduled to execute when changed to an UP state. Base partitions can also be moved between slurm partitions either by changing the slurm.conf file and restarting the slurmctld daemon or by using the scontrol reconfig command.

SLURM node and partition descriptions should make use of the naming conventions described above. For example, "NodeName=bg[000x733] CPUs=1024" is used in slurm.conf to define a BlueGene system with 128 midplanes in an 8 by 4 by 4 matrix and each midplane is configured with 1024 processors (cores). The node name prefix of "bg" defined by NodeName can be anything you want, but needs to be consistent throughout the slurm.conf file. No computer is actually expected to a hostname of "bg000" and no attempt will be made to route message traffic to this address.

Front end nodes used for executing the slurmd daemons must also be defined in the slurm.conf file. It is recommended that at least two front end nodes be dedicated to use by the slurmd daemons for fault tolerance. For example: "FrontendName=frontend[00-03] State=UNKNOWN" is used to define four front end nodes for running slurmd daemons.

# Portion of slurm.conf for BlueGene system
InactiveLimit=0
SelectType=select/bluegene
Prolog=/usr/sbin/prolog
Epilog=/usr/sbin/epilog
#
FrontendName=frontend[00-01] State=UNKNOWN
NodeName=bg[000x733] CPUs=1024 State=UNKNOWN

While users are unable to initiate SLURM job steps on BlueGene/L or BlueGene/P systems, this restriction does not apply to user root or SlurmUser. Be advised that the slurmd daemon is unable to manage a large number of job steps, so this ability should be used only to verify normal SLURM operation. If large numbers of job steps are initiated by slurmd, expect the daemon to fail due to lack of memory or other resources. It is best to minimize other work on the front end nodes executing slurmd so as to maximize its performance and minimize other risk factors.

Bluegene.conf File Creation

In addition to the normal slurm.conf file, a new bluegene.conf configuration file is required with information pertinent to the system. Put bluegene.conf into the SLURM configuration directory with slurm.conf. A sample file is installed in bluegene.conf.example. System administrators should use the smap tool to build appropriate configuration file for static partitioning. Note that smap -Dc can be run without the SLURM daemons active to establish the initial configuration. Note that the bgblocks defined using smap may not overlap (except for the full-system bgblock, which is implicitly created). See the smap man page for more information.

There are 3 different modes which the system administrator can define BlueGene partitions (or bgblocks) available to execute jobs: static, overlap, and dynamic. Jobs must then execute in one of the created bgblocks. (NOTE: bgblocks are unrelated to SLURM partitions.)

The default mode of partitioning is static. In this mode, the system administrator must explicitly define each of the bgblocks in the bluegene.conf file. Each of these bgblocks are explicitly configured with either a mesh or torus interconnect. They must also not overlap, except for the implicitly defined full-system bgblock. Note that bgblocks are not rebooted between jobs in the mode except when going to/from full-system jobs. Eliminating bgblock booting can significantly improve system utilization (eliminating boot time) and reliability.

The second mode is overlap partitioning. Overlap partitioning is very similar to static partitioning in that each bgblocks must be explicitly defined in the bluegene.conf file, but these partitions can overlap each other. In this mode it is highly recommended that none of the bgblocks have any passthroughs in the X-dimension associated to them. Usually this is only an issue on larger BlueGene systems. It is advisable to use this mode with extreme caution. Make sure you know what you doing to assure the bgblocks will boot without dependency on the state of any base partition not included the bgblock.

In the two previous modes you must insure that the base partitions defined in bluegene.conf are consistent with those defined in slurm.conf. Note the bluegene.conf file contains only the numeric coordinates of base partitions while slurm.conf contains the name prefix in addition to the numeric coordinates.

The final mode is dynamic partitioning. Dynamic partitioning was developed primarily for smaller BlueGene systems, but can be used on larger systems. Dynamic partitioning may introduce fragmentation of resources. This fragmentation may be severe since SLURM will run a job anywhere resources are available with little thought of the future. As with overlap partitioning, use dynamic partitioning with caution! This mode can result in job starvation since smaller jobs will run if resources are available and prevent larger jobs from running. Bgblocks need not be assigned in the bluegene.conf file for this mode.

Blocks can be freed or set in an error state with scontrol, (i.e. "scontrol update BlockName=RMP0 state=error"). This will end any job on the block and set the state of the block to ERROR making it so no job will run on the block. To set it back to a usable state, set the state to free (i.e. "scontrol update BlockName=RMP0 state=free").

Alternatively, if only part of a base partition needs to be put into an error state which isn't already in a block of the size you need, you can set a collection of IO nodes into an error state using scontrol (i.e. "scontrol update subbpname=bg000[0-3] state=error"). This will end any job on the nodes listed, create a block there, and set the state of the block to ERROR making it so no job will run on the block. To set it back to a usable state set the state to free (i.e. "scontrol update BlockName=RMP0 state=free" or "scontrol update subbpname=bg000[0-3] state=free"). This is helpful to allow other jobs to run on the unaffected nodes in the base partition.

One of these modes must be defined in the bluegene.conf file with the option LayoutMode=MODE (where MODE=STATIC, DYNAMIC or OVERLAP).

The number of c-nodes in a base partition and in a node card must be defined. This is done using the keywords BasePartitionNodeCnt=NODE_COUNT and NodeCardNodeCnt=NODE_COUNT respectively in the bluegene.conf file (i.e. BasePartitionNodeCnt=512 and NodeCardNodeCnt=32).

Note that the Numpsets values defined in bluegene.conf is used only when SLURM creates bgblocks this determines if the system is IO rich or not. For most BlueGene/L systems this value is either 8 (for IO poor systems) or 64 (for IO rich systems).

The Images file specifications identify which images are used when booting a bgblock and the valid images are different for each BlueGene system type (e.g. L, P and Q). Their values can change during job allocation based on input from the user. If you change the bgblock layout, then slurmctld and slurmd should both be cold-started (without preserving any state information, "/etc/init.d/slurm startclean").

If you wish to modify the Numpsets values for existing bgblocks, either modify them manually or destroy the bgblocks and let SLURM recreate them. Note that in addition to the bgblocks defined in bluegene.conf, an additional bgblock is created containing all resources defined all of the other defined bgblocks. Make use of the SLURM partition mechanism to control access to these bgblocks. A sample bluegene.conf file is shown below.

###############################################################################
# Global specifications for a BlueGene/L system
#
# BlrtsImage:           BlrtsImage used for creation of all bgblocks.
# LinuxImage:           LinuxImage used for creation of all bgblocks.
# MloaderImage:         MloaderImage used for creation of all bgblocks.
# RamDiskImage:         RamDiskImage used for creation of all bgblocks.
#
# You may add extra images which a user can specify from the srun
# command line (see man srun).  When adding these images you may also add
# a Groups= at the end of the image path to specify which groups can
# use the image.
#
# AltBlrtsImage:           Alternative BlrtsImage(s).
# AltLinuxImage:           Alternative LinuxImage(s).
# AltMloaderImage:         Alternative MloaderImage(s).
# AltRamDiskImage:         Alternative RamDiskImage(s).
#
# LayoutMode:           Mode in which slurm will create blocks:
#                       STATIC:  Use defined non-overlapping bgblocks
#                       OVERLAP: Use defined bgblocks, which may overlap
#                       DYNAMIC: Create bgblocks as needed for each job
# BasePartitionNodeCnt: Number of c-nodes per base partition
# NodeCardNodeCnt:      Number of c-nodes per node card.
# Numpsets:             The Numpsets used for creation of all bgblocks
#                       equals this value multiplied by the number of
#                       base partitions in the bgblock.
#
# BridgeAPILogFile:  Pathname of file in which to write the
#                    Bridge API logs.
# BridgeAPIVerbose:  How verbose the BG Bridge API logs should be
#                    0: Log only error and warning messages
#                    1: Log level 0 and information messages
#                    2: Log level 1 and basic debug messages
#                    3: Log level 2 and more debug message
#                    4: Log all messages
# DenyPassthrough:   Prevents use of passthrough ports in specific
#                    dimensions, X, Y, and/or Z, plus ALL
#
# NOTE: The bgl_serial value is set at configuration time using the
#       "--with-bgl-serial=" option. Its default value is "BGL".
###############################################################################
# These are the default images with are used if the user doesn't specify
# which image they want
BlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts
LinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf
MloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts
RamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf

#Only group jette can use these images
AltBlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw2.rts Groups=jette
AltLinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage2.elf Groups=jette
AltMloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader2.rts Groups=jette
AltRamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk2.elf Groups=jette

# Since no groups are specified here any user can use them
AltBlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw3.rts
AltLinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage3.elf
AltMloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader3.rts
AltRamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk3.elf

# Another option for images would be a "You can use anything you like image" *
# This allows the user to use any image entered with no security checking
AltBlrtsImage=* Groups=da,adamb
AltLinuxImage=* Groups=da,adamb
AltMloaderImage=* Groups=da,adamb
AltRamDiskImage=*  Groups=da,adamb

LayoutMode=STATIC
BasePartitionNodeCnt=512
NodeCardNodeCnt=32
NumPsets=64	# An I/O rich environment
BridgeAPILogFile=/var/log/slurm/bridgeapi.log
BridgeAPIVerbose=0

#DenyPassthrough=X,Y,Z

###############################################################################
# Define the static/overlap partitions (bgblocks)
#
# BPs: The base partitions (midplanes) in the bgblock using XYZ coordinates
# Type:  Connection type "MESH" or "TORUS" or "SMALL", default is "TORUS"
#        Type SMALL will divide a midplane into multiple bgblocks
#        based off options NodeCards and Quarters to determine type of
#        small blocks.
#
# IMPORTANT NOTES:
# * Ordering is very important for laying out switch wires.  Please create
#   blocks with smap, and once done don't move the order of blocks
#   created.
# * A bgblock is implicitly created containing all resources on the system
# * Bgblocks must not overlap (except for implicitly created bgblock)
#   This will be the case when smap is used to create a configuration file
# * All Base partitions defined here must also be defined in the slurm.conf file
# * Define only the numeric coordinates of the bgblocks here. The prefix
#   will be based upon the name defined in slurm.conf
###############################################################################
# LEAVE NEXT LINE AS A COMMENT, Full-system bgblock, implicitly created
# BPs=[000x001] Type=TORUS       # 1x1x2 = 2 midplanes
###############################################################################
# volume = 1x1x1 = 1
BPs=[000x000] Type=TORUS                            # 1x1x1 =  1 midplane
BPs=[001x001] Type=SMALL 32CNBlocks=4 128CNBlocks=3 # 1x1x1 = 4-Nodecard sized
                                                    # c-node blocks 3-Base
                                                    # Partition Quarter sized
                                                    # c-node blocks

The above bluegene.conf file defines multiple bgblocks to be created in a single midplane (see the "SMALL" option). Using this mechanism, up to 32 independent jobs each consisting of 32 c-nodes can be executed simultaneously on a one-rack BlueGene system. If defining bgblocks of Type=SMALL, the SLURM partition containing them as defined in slurm.conf must have the parameter Shared=force to enable scheduling of multiple jobs on what SLURM considers a single node. SLURM partitions that do not contain bgblocks of Type=SMALL may have the parameter Shared=no for a slight improvement in scheduler performance. As in all SLURM configuration files, parameters and values are case insensitive.

The valid image names on a BlueGene/P system are CnloadImage, MloaderImage, and IoloadImage. The only image name on BlueGene/Q systems is MloaderImage. Alternate images may be specified as described above for all BlueGene system types.

One more thing is required to support SLURM interactions with the DB2 database (at least as of the time this was written). DB2 database access is required by the slurmctld daemon only. All other SLURM daemons and commands interact with DB2 using remote procedure calls, which are processed by slurmctld. DB2 access is dependent upon the environment variable BRIDGE_CONFIG_FILE. Make sure this is set appropriate before initiating the slurmctld daemon. If desired, this environment variable and any other logic can be executed through the script /etc/sysconfig/slurm, which is automatically executed by /etc/init.d/slurm prior to initiating the SLURM daemons.

When slurmctld is initially started on an idle system, the bgblocks already defined in MMCS are read using the Bridge APIs. If these bgblocks do not correspond to those defined in the bluegene.conf file, the old bgblocks with a prefix of "RMP" are destroyed and new ones created. When a job is scheduled, the appropriate bgblock is identified, its user set, and it is booted. Node use (virtual or coprocessor) is set from the mpirun command line now, SLURM has nothing to do with setting the node use. Subsequent jobs use this same bgblock without rebooting by changing the associated user field. The only time bgblocks should be freed and rebooted, in normal operation, is when going to or from full-system jobs (two or more bgblocks sharing base partitions can not be in a ready state at the same time). When this logic became available at LLNL, approximately 85 percent of bgblock boots were eliminated and the overhead of job startup went from about 24% to about 6% of total job time. Note that bgblocks will remain in a ready (booted) state when the SLURM daemons are stopped. This permits SLURM daemon restarts without loss of running jobs or rebooting of bgblocks.

Be aware that SLURM will issue multiple bgblock boot requests as needed (e.g. when the boot fails). If the bgblock boot requests repeatedly fail, SLURM will configure the failing base partitions to a DRAINED state so as to avoid continuing repeated reboots and the likely failure of user jobs. A system administrator should address the problem before returning the base partitions to service.

If the slurmctld daemon is cold-started (/etc/init.d/slurm startclean or slurmctld -c) it is recommended that the slurmd daemon(s) be cold-started at the same time. Failure to do so may result in errors being reported by both slurmd and slurmctld due to bgblocks that previously existed being deleted.

A new tool sfree has also been added to help system administrators free a bgblock on request (i.e. "sfree --bgblock=<blockname>"). Run sfree --help for more information.

Resource Reservations

SLURM's advance reservation mechanism can accept a node count specification as input rather than identification of specific nodes/midplanes. In that case, SLURM may reserve nodes/midplanes which may not be formed into an appropriate bgblock. Work is planned for SLURM version 2.4 to remedy this problem. Until that time, identifying the specific nodes/midplanes to be included in an advanced reservation may be necessary.

SLURM's advance reservation mechanism is designed to reserve resources at the level of whole nodes, which on a BlueGene systems would represent whole midplanes. In order to support advanced reservations with a finer grained resolution, you can configure one license per c-node on the system and reserve c-nodes instead of entire midplanes. Note that reserved licenses are treated somewhat differently than reserved nodes. When nodes are reserved then jobs using that reservation can use only those nodes. Reserved licenses can only be used by jobs associated with that reservation, but licenses not explicitly reserved are available to any job.

For example, in slurm.conf specify something of this sort: "Licenses=cnode*512". Then create an advanced reservation with a command like this:
"scontrol create reservation licenses="cnode*32" starttime=now duration=30:00 users=joe".
Jobs run in this reservation will then have at least 32 c-nodes available for their use, but could use more given an appropriate workload.

There is also a job_submit/cnode plugin available for use that will automatically set a job's license specification to match its c-node request (i.e. a command like
"sbatch -N32 my.sh" would automatically be translated to
"sbatch -N32 --licenses=cnode*32 my.sh" by the slurmctld daemon. Enable this plugin in the slurm.conf configuration file with the option "JobSubmitPlugins=cnode".

Debugging

All of the testing and debugging guidance provided in Quick Start Administrator Guide apply to BlueGene systems. One can start the slurmctld and slurmd daemons in the foreground with extensive debugging to establish basic functionality. Once running in production, the configured SlurmctldLog and SlurmdLog files will provide historical system information. On BlueGene systems, there is also a BridgeAPILogFile defined in bluegene.conf which can be configured to contain detailed information about every Bridge API call issued.

Note that slurmcltld log messages of the sort Nodes bg[000x133] not responding are indicative of the slurmd daemon serving as a front-end to those base partitions is not responding (on non-BlueGene systems, the slurmd actually does run on the compute nodes, so the message is more meaningful there).

Note that you can emulate a BlueGene/L system on stand-alone Linux system. Run configure with the --enable-bgl-emulation option. This will define "HAVE_BG", "HAVE_BGL", and "HAVE_FRONT_END" in the config.h file. You can also emulate a BlueGene/P system with the --enable-bgp-emulation option. This will define "HAVE_BG", "HAVE_BGP", and "HAVE_FRONT_END" in the config.h file. You can also emulate a BlueGene/Q system using the --enable-bgq-emulation option. This will define "HAVE_BG", "HAVE_BGQ", and "HAVE_FRONT_END" in the config.h file. Then execute make normally. These variables will build the code as if it were running on an actual BlueGene computer, but avoid making calls to the Bridge library (that is controlled by the variable "HAVE_BG_FILES", which is left undefined). You can use this to test configurations, scheduling logic, etc.

Last modified 16 August 2011