SLURM Job Checkpoint Plugin Programmer Guide
Overview
This document describes SLURM job checkpoint plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own SLURM job checkpoint plugins. This is version 0 of the API.
SLURM job checkpoint plugins are SLURM plugins that implement the SLURM API for checkpointing and restarting jobs. The plugins must conform to the SLURM Plugin API with the following specifications:
const char plugin_type[]
The major type must be "checkpoint." The minor type can be any recognizable
abbreviation for the type of checkpoint mechanism.
We recommend, for example:
- aixAIX system checkpoint.
- blcr Berkeley Lab Checkpoint/Restart (BLCR)
- noneNo job checkpoint.
- ompiOpenMPI checkpoint (requires OpenMPI version 1.3 or higher).
The plugin_name and plugin_version symbols required by the SLURM Plugin API require no specialization for job checkpoint support. Note carefully, however, the versioning discussion below.
The programmer is urged to study src/plugins/checkpoint/checkpoint_aix.c for a sample implementation of a SLURM job checkpoint plugin.
Data Objects
The implementation must maintain (though not necessarily directly export) an enumerated errno to allow SLURM to discover as practically as possible the reason for any failed API call. Plugin-specific enumerated integer values may be used when appropriate.
These values must not be used as return values in integer-valued functions in the API. The proper error return value from integer-valued functions is SLURM_ERROR. The implementation should endeavor to provide useful and pertinent information by whatever means is practical. Successful API calls are not required to reset any errno to a known value. However, the initial value of any errno, prior to any error condition arising, should be SLURM_SUCCESS.
There is also a checkpoint-specific error code and message that may be associated with each job step.
API Functions
The following functions must appear. Functions which are not implemented should be stubbed.
int slurm_ckpt_alloc_job (check_jobinfo_t *jobinfo);
Description: Allocate storage for job-step specific checkpoint data.
Argument: jobinfo (output) returns pointer to the allocated storage.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int slurm_ckpt_free_job (check_jobinfo_t jobinfo);
Description: Release storage for job-step specific checkpoint data that was previously allocated by slurm_ckpt_alloc_job.
Argument: jobinfo (input) pointer to the previously allocated storage.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int slurm_ckpt_pack_job (check_jobinfo_t jobinfo, Buf buffer);
Description: Store job-step specific checkpoint data into a buffer.
Arguments:
jobinfo (input) pointer to the previously allocated storage.
Buf (input/output) buffer to which jobinfo has been appended.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int slurm_ckpt_unpack_job (check_jobinfo_t jobinfo, Buf buffer);
Description: Retrieve job-step specific checkpoint data from a buffer.
Arguments:
jobinfo (output) pointer to the previously allocated storage.
Buf (input/output) buffer from which jobinfo has been removed.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.
int slurm_ckpt_op ( uint32_t job_id, uint32_t step_id, struct step_record *step_ptr, uint16_t op, uint16_t data, char *image_dir, time_t *event_time, uint32_t *error_code, char **error_msg );
Description: Perform some checkpoint operation on a specific job step.
Arguments:
job_id (input) identifies the job to be operated upon.
May be SLURM_BATCH_SCRIPT for a batch job or NO_VAL for all steps of the
specified job.
step_id (input) identifies the job step to be operated upon.
step_ptr (input) pointer to the job step to be operated upon.
Used by checkpoint/aix only.
op (input) specifies the operation to be performed.
Currently supported operations include
CHECK_ABLE (is job step currently able to be checkpointed),
CHECK_DISABLE (disable checkpoints for this job step),
CHECK_ENABLE (enable checkpoints for this job step),
CHECK_CREATE (create a checkpoint for this job step and continue its execution),
CHECK_VACATE (create a checkpoint for this job step and terminate it),
CHECK_RESTART (restart this previously checkpointed job step), and
CHECK_ERROR (return checkpoint-specific error information for this job step).
data (input) operation-specific data.
image_dir (input) directory to be used to save or restore state.
event_time (output) identifies the time of a checkpoint or restart
operation.
error_code (output) returns checkpoint-specific error code
associated with an operation.
error_msg (output) identifies checkpoint-specific error message
associated with an operation.
Returns:
SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the error_code and error_msg
to an appropriate value to indicate the reason for failure.
int slurm_ckpt_comp ( struct step_record * step_ptr, time_t event_time, uint32_t error_code, char *error_msg );
Description: Note the completion of a checkpoint operation.
Arguments:
step_ptr (input/output) identifies the job step to be operated upon.
event_time (input) identifies the time that the checkpoint operation
began.
error_code (input) checkpoint-specific error code associated
with an operation.
error_msg (input) checkpoint-specific error message associated
with an operation.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the error_code and error_msg to an appropriate value to indicate the reason for failure.
int slurm_ckpt_stepd_prefork ( void *slurmd_job );
Description: Do preparation work for the checkpoint/restart support. This function is called by slurmstepd before forking the user tasks.
Arguments:
slurmd_job (input) pointer to job structure internal to slurmstepd.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the error_code and error_msg to an appropriate value to indicate the reason for failure.
int slurm_ckpt_signal_tasks ( void *slurmd_job, char *image_dir );
Description: Forward the checkpoint request to tasks managed by slurmstepd.
Arguments:
slurmd_job (input) pointer to job structure internal to slurmstepd.
image_dir (input) directory to be used to save or restore state.
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the error_code and error_msg to an appropriate value to indicate the reason for failure.
int slurm_ckpt_restart_task ( void *slurmd_job, char *image_dir, int gtid);
Description: Restart the execution of a tasks from a checkpoint image, called by slurmstepd.
Arguments:
slurmd_job (input) pointer to job structure internal to slurmstepd.
image_dir (input) directory to be used to save or restore state.
gtid (input) global task ID to be restarted
Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the error_code and error_msg to an appropriate value to indicate the reason for failure.
Versioning
This document describes version 100 of the SLURM checkpoint API. Future releases of SLURM may revise this API. A checkpoint plugin conveys its ability to implement a particular API version using the mechanism outlined for SLURM plugins.
Last modified 10 March 2009