SLURM Switch Plugin API
Overview
This document describes SLURM switch (interconnect) plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own SLURM switch plugins. This is version 0 of the API. Note that many of the API functions are used only by one of the daemons. For example the slurmctld daemon builds a job step's switch credential (switch_p_build_jobinfo) while the slurmd daemon enables and disables that credential for the job step's tasks on a particular node(switch_p_job_init, etc.).
SLURM switch plugins are SLURM plugins that implement the SLURM switch or interconnect API described herein. They must conform to the SLURM Plugin API with the following specifications:
const char plugin_type[]
The major type must be "switch." The minor type can be any recognizable
abbreviation for the type of switch. We recommend, for example:
- noneA plugin that implements the API without providing any actual switch service. This is the case for Ethernet and Myrinet interconnects.
- elanQuadrics Elan3 or Elan4 interconnect.
- federationIBM Federation interconnects (presently under development).
The plugin_name and plugin_version symbols required by the SLURM Plugin API require no specialization for switch support. Note carefully, however, the versioning discussion below.
The programmer is urged to study src/plugins/switch/switch_elan.c and src/plugins/switch/switch_none.c for sample implementations of a SLURM switch plugin.
Data Objects
The implementation must support two opaque data classes. One is used as an job's switch "credential." This class must encapsulate all job-specific information necessary for the operation of the API specification below. The second is a node's switch state record. Both data classes are referred to in SLURM code using an anonymous pointer (void *).
The implementation must maintain (though not necessarily directly export) an enumerated errno to allow SLURM to discover as practically as possible the reason for any failed API call. Plugin-specific enumerated integer values should be used when appropriate. It is desirable that these values be mapped into the range ESLURM_SWITCH_MIN and ESLURM_SWITCH_MAX as defined in slurm/slurm_errno.h. The error number should be returned by the function switch_p_get_errno() and this error number can be converted to an appropriate string description using the switch_p_strerror() function described below.
These values must not be used as return values in integer-valued functions in the API. The proper error return value from integer-valued functions is SLURM_ERROR. The implementation should endeavor to provide useful and pertinent information by whatever means is practical. In some cases this means an errno for each credential, since plugins must be re-entrant. If a plugin maintains a global errno in place of or in addition to a per-credential errno, it is not required to enforce mutual exclusion on it. Successful API calls are not required to reset any errno to a known value. However, the initial value of any errno, prior to any error condition arising, should be SLURM_SUCCESS.
API Functions
The following functions must appear. Functions which are not implemented should be stubbed.
Global Switch State Functions
int switch_p_libstate_save (char *dir_name);
Description: Save any global switch state to a file within the specified directory. The actual file name used is plugin specific. It is recommended that the global switch state contain a magic number for validation purposes. This function is called by the slurmctld deamon on shutdown. Note that if the slurmctld daemon fails, this function will not be called. The plugin may save state independently and/or make use of the switch_p_job_step_allocated function to restore state.
Arguments: dir_name (input) fully-qualified pathname of a directory into which user SlurmUser (as defined in slurm.conf) can create a file and write state information into that file. Cannot be NULL.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_libstate_restore(char *dir_name, bool recover);
Description: Restore any global switch state from a file within the specified directory. The actual file name used is plugin specific. It is recommended that any magic number associated with the global switch state be verified. This function is called by the slurmctld deamon on startup.
Arguments:
dir_name
(input) fully-qualified pathname of a directory containing a state information file
from which user SlurmUser (as defined in slurm.conf) can read. Cannot be NULL.
recover
true of restart with state preserved, false if no state recovery.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_libstate_clear (void);
Description: Clear switch state information.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
Node's Switch State Monitoring Functions
Nodes will register with current switch state information when the slurmd daemon is initiated. The slurmctld daemon will also request that slurmd supply current switch state information on a periodic basis.
int switch_p_clear_node_state (void);
Description: Initialize node state. If any switch state has previously been established for a job, it will be cleared. This will be used to establish a "clean" state for the switch on the node upon which it is executed.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_alloc_node_info(switch_node_info_t *switch_node);
Description: Allocate storage for a node's switch state record. It is recommended that the record contain a magic number for validation purposes.
Arguments: switch_node (output) location for writing location of node's switch state record.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_build_node_info(switch_node_info_t switch_node);
Description: Fill in a previously allocated switch state record for the node on which this function is executed. It is recommended that the magic number be validated.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_pack_node_info (switch_node_info_t switch_node, Buf buffer);
Description: Pack the data associated with a node's switch state into a buffer for network transmission.
Arguments:
switch_node (input) an existing
node's switch state record.
buffer (input/output) buffer onto
which the switch state information is appended.
Returns: The number of bytes written should be returned if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_unpack_node_info (switch_node_info_t switch_node, Buf buffer);
Description: Unpack the data associated with a node's switch state record from a buffer.
Arguments:
switch_node (input/output) a
previously allocated node switch state record to be filled in with data read from
the buffer.
buffer (input/output) buffer from
which the record's contents are read.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
void switch_p_free_node_info (switch_node_info_t switch_node);
Description: Release the storage associated with a node's switch state record.
Arguments: switch_node (intput/output) a previously allocated node switch state record.
Returns: None
char * switch_p_sprintf_node_info (switch_node_info_t switch_node, char *buf, size_t size);
Description: Print the contents of a node's switch state record to a buffer.
Arguments:
switch_node (input) a
node's switch state record.
buf (input/output) point to
buffer into which the switch state record is to be written.
of buf in bytes.
size (input) size
of buf in bytes.
Returns: Location of buffer, same as buf.
Job's Switch Credential Management Functions
int switch_p_alloc_jobinfo(switch_jobinfo_t *switch_job);
Description: Allocate storage for a job's switch credential. It is recommended that the credential contain a magic number for validation purposes.
Arguments: switch_job (output) location for writing location of job's switch credential.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_build_jobinfo (switch_jobinfo_t switch_job, char *nodelist, int *tasks_per_node, int cyclic_alloc, char *network);
Description: Build a job's switch credential. It is recommended that the credential's magic number be validated.
Arguments:
switch_job (input/output) Job's
switch credential to be updated
nodelist (input) List of nodes
allocated to the job. This may contain expressions to specify node ranges (e.g.
"linux[1-20]" or "linux[2,4,6,8]").
tasks_per_node (input) List
of processes per node to be initiated as part of the job.
cyclic_alloc (input) Non-zero
if job's processes are to be allocated across nodes in a cyclic fashion (task 0 on node 0,
task 1 on node 1, etc). If zero, processes are allocated sequentially on a node before
moving to the next node (tasks 0 and 1 on node 0, tasks 2 and 3 on node 1, etc.).
network (input) Job's network
specification from srun command.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
switch_jobinfo_t switch_p_copy_jobinfo (switch_jobinfo_t switch_job);
Description: Allocate storage for a job's switch credential and copy an existing credential to that location.
Arguments: switch_job (input) an existing job switch credential.
Returns: A newly allocated job switch credential containing a copy of the function argument.
void switch_p_free_jobinfo (switch_jobinfo_t switch_job);
Description: Release the storage associated with a job's switch credential.
Arguments: switch_job (intput) an existing job switch credential.
Returns: None
int switch_p_pack_jobinfo (switch_jobinfo_t switch_job, Buf buffer);
Description: Pack the data associated with a job's switch credential into a buffer for network transmission.
Arguments:
switch_job (input) an existing job
switch credential.
buffer (input/output) buffer onto
which the credential's contents are appended.
Returns: The number of bytes written should be returned if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_unpack_jobinfo (switch_jobinfo_t switch_job, Buf buffer);
Description: Unpack the data associated with a job's switch credential from a buffer.
Arguments:
switch_job (input/output) a previously
allocated job switch credential to be filled in with data read from the buffer.
buffer (input/output) buffer from
which the credential's contents are read.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_get_jobinfo (switch_jobinfo_t switch_job, int data_type, void *data);
Description: Get some specific data from a job's switch credential.
Arguments:
switch_job (input) a job's switch credential.
data_type (input) identification
as to the type of data requested. The interpretation of this value is plugin dependent.
data (output) filled in with the desired
data. The form of this data is dependent upon the value of data_type and the plugin.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_job_step_complete (switch_jobinfo_t switch_job, char *nodelist);
Description: Note that the job step associated with the specified nodelist has completed execution.
Arguments:
switch_job (input)
The completed job's switch credential.
nodelist (input) A list of nodes
on which the job has completed. This may contain expressions to specify node ranges.
(e.g. "linux[1-20]" or "linux[2,4,6,8]").
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_job_step_part_comp (switch_jobinfo_t switch_job, char *nodelist);
Description: Note that the job step has completed execution on the specified node list. The job step is not necessarily completed on all nodes, but switch resources associated with it on the specified nodes are no longer in use.
Arguments:
switch_job (input)
The completed job's switch credential.
nodelist (input) A list of nodes
on which the job step has completed. This may contain expressions to specify node ranges.
(e.g. "linux[1-20]" or "linux[2,4,6,8]").
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
bool switch_p_part_comp (void);
Description: Indicate if the switch plugin should process partial job step completions (i.e. switch_g_job_step_part_comp). Support of partition completions is compute intensive, so it should be avoided unless switch resources are in short supply (e.g. switch/federation).
Returns: True if partition step completions are to be recorded. False if only full job step completions are to be noted.
void switch_p_print_jobinfo(FILE *fp, switch_jobinfo_t switch_job);
Description: Print the contents of a job's switch credential to a file.
Arguments:
fp (input) pointer to an open file.
switch_job (input) a job's
switch credential.
Returns: None.
char *switch_p_sprint_jobinfo(switch_jobinfo_t switch_job, char *buf, size_t size);
Description: Print the contents of a job's switch credential to a buffer.
Arguments:
switch_job (input) a job's
switch credential.
buf (input/output) pointer to
buffer into which the job credential information is to be written.
size (input) size of buf in
bytes
Returns: location of buffer, same as buf.
int switch_p_get_data_jobinfo(switch_jobinfo_t switch_job, int key, void *resulting_data);
Description: Get data from a job's switch credential.
Arguments:
switch_job (input) a job's
switch credential.
key (input) identification
of the type of data to be retrieved from the switch credential. NOTE: The
interpretation of this key is dependent upon the switch type.
resulting_data (input/output)
pointer to where the requested data should be stored.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
Node Specific Switch Management Functions
int switch_p_node_init (void);
Description: This function is run from the top level slurmd only once per slurmd run. It may be used, for instance, to perform some one-time interconnect setup or spawn an error handling thread.
Arguments: None
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_node_fini (void);
Description: This function is called once as slurmd exits (slurmd will wait for this function to return before continuing the exit process).
Arguments: None
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
Job Management Functions
========================================================================= Process 1 (root) Process 2 (root, user) | Process 3 (user task) | switch_p_job_preinit | fork ------------------ switch_p_job_init | waitpid setuid, chdir, etc. | fork N procs -----------+--- switch_p_job_attach wait all | exec mpi process switch_p_job_fini* | switch_p_job_postfini | =========================================================================
int switch_p_job_preinit (switch_jobinfo_t jobinfo switch_job);
Description: Preinit is run as root in the first slurmd process, the so called job manager. This function can be used to perform any initialization that needs to be performed in the same process as switch_p_job_fini().
Arguments: switch_job (input) a job's switch credential.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_job_init (switch_jobinfo_t jobinfo switch_job, uid_t uid);
Description: Initialize interconnect on node for a job. This function is run from the second slurmd process (some interconnect implementations may require the switch_p_job_init functions to be executed from a separate process than the process executing switch_p_job_fini() [e.g. Quadrics Elan]).
Arguments:
switch_job (input) a job's
switch credential.
uid (input) the user id
to execute a job.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_job_attach ( switch_jobinfo_t switch_job, char ***env, uint32_t nodeid, uint32_t procid, uint32_t nnodes, uint32_t nprocs, uint32_t rank );
Description: Attach process to interconnect (Called from within the process, so it is appropriate to set interconnect specific environment variables here).
Arguments:
switch_job (input) a job's
switch credential.
env (input/output) the
environment variables to be set upon job initiation. Switch specific environment
variables are added as needed.
nodeid (input) zero-origin
id of this node.
procid (input) zero-origin
process id local to slurmd and not equivalent to the global task id or MPI rank.
nnodes (input) count of
nodes allocated to this job.
nprocs (input) total count of
processes or tasks to be initiated for this job.
rank (input) zero-origin
id of this task.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_job_fini (switch_jobinfo_t jobinfo switch_job);
Description: This function is run from the same process as switch_p_job_init() after all job tasks have exited. It is *not* run as root, because the process in question has already setuid to the job owner.
Arguments: switch_job (input) a job's switch credential.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_job_postfini ( switch_jobinfo_t switch_job, uid_t pgid, uint32_t job_id, uint32_t step_id );
Description: This function is run from the initial slurmd process (same process as switch_p_job_preinit()), and is run as root. Any cleanup routines that need to be run with root privileges should be run from this function.
Arguments:
switch_job (input) a job's
switch credential.
pgid (input) The process
group id associated with this task.
job_id (input) the
associated SLURM job id.
step_id (input) the
associated SLURM job step id.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int switch_p_job_step_allocated (switch_jobinfo_t jobinfo switch_job, char *nodelist);
Description: Note that the identified job step is active at restart time. This function can be used to restore global switch state information based upon job steps known to be active at restart time. Use of this function is preferred over switch state saved and restored by the switch plugin. Direct use of job step switch information eliminates the possibility of inconsistent state information between the switch and job steps.
Arguments:
switch_job (input) a job's
switch credential.
nodelist (input) the nodes
allocated to a job step.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
Error Handling Functions
int switch_p_get_errno (void);
Description: Return the number of a switch specific error.
Arguments: None
Returns: Error number for the last failure encountered by the switch plugin.
char *switch_p_strerror(int errnum);
Description: Return a string description of a switch specific error code.
Arguments: errnum (input) a switch specific error code.
Returns: Pointer to string describing the error or NULL if no description found in this plugin.
Versioning
This document describes version 0 of the SLURM Switch API. Future releases of SLURM may revise this API. A switch plugin conveys its ability to implement a particular API version using the mechanism outlined for SLURM plugins. In addition, the credential is transmitted along with the version number of the plugin that transmitted it. It is at the discretion of the plugin author whether to maintain data format compatibility across different versions of the plugin.
Last modified 5 September 2008