psij.executors.batch package

Submodules

psij.executors.batch.batch_scheduler_executor module

class BatchSchedulerExecutor(url=None, config=None)[source]

Bases: JobExecutor

A base class for batch scheduler executors.

This class implements a generic JobExecutor that interacts with batch schedulers. There are two main components to the executor: job submission and queue polling. Submission is implemented by generating a submit script which is then fed to the queuing system submit command.

The submit script is generated using a generate_submit_script(). An implementation of this functionality based on Mustache/Pystache (see https://mustache.github.io/ and https://pypi.org/project/pystache/) exists in TemplatedScriptGenerator. This class can be instantiated by concrete implementations of a batch scheduler executor and the submit script generation can be delegated to that instance, which has a method whose signature matches that of generate_submit_script(). Besides an opened file which points to where the contents of the submit script are to be written, the parameters to generate_submit_script() are the Job that is being submitted and a context, which is a dictionary with the following structure:

{
    'job': <the job being submitted>
    'psij': {
        'lib': <dict; function library>,
        'launch_command': <str; launch command>,
        'script_dir': <str; directory where the submit script is generated>
    }
}

The script directory is a directory (typically ~/.psij/work) where submit scripts are written; it is also used for auxiliary files, such as the exit code file (see below) or the script output file.

The launch command is a list of strings which the script generator should render as the command to execute. It wraps the job executable in the proper Launcher.

The function library is a dictionary mapping function names to functions for all public functions in the template_function_library module.

The submit script must perform two essential actions:

1. redirect the output of the executable part of the script to the script output file, which is a file in <script_dir> named <native_id>.out, where <native_id> is the id given to the job by the queuing system.

2. store the exit code of the launch command in the exit code file named <native_id>.ec, also inside <script_dir>.

Additionally, where appropriate, the submit script should set the environment variable named PSIJ_NODEFILE to point to a file containing a list of nodes that are allocated for the job, one per line, with a total number of lines matching the process count of the job.

Once the submit script is generated, the executor renders the submit command using get_submit_command() and executes it. Its output is then parsed using job_id_from_submit_output() to retrieve the native_id of the job. Subsequently, the job is registered with the queue polling thread.

The queue polling thread regularly polls the batch scheduler queue for updates to job states. It builds the command for polling the queue using get_status_command(), which takes a list of native_id strings corresponding to all registered jobs. Implementations are strongly encouraged to restrict the query of job states to the specified jobs in order to reduce the load on the queuing system. The output of the status command is then parsed using parse_status_output() and the status of each job is updated accordingly. If the status of a registered job is not found in the output of the queue status command, it is assumed completed (or failed, depending on its exit code), since most queuing systems automatically purge completed jobs from their databases after a short period of time. The exit code is read from the exit code file, as described above. If the exit code value is not zero, the job is assumed failed and an attempt is made to read an error message from the script output file.

Parameters
attach(job, native_id)[source]

Attaches a job to a native job.

Attempts to connect job to a native job with native_id such that the job correctly reflects updates to the status of the native job. If the native job was previously submitted using this executor (hence having an exit code file and a script output file), the executor will attempt to retrieve the exit code and errors from the job. Otherwise, it may be impossible for the executor to distinguish between a failed and successfully completed job.

Parameters
  • job (Job) – The PSI/J job to attach.

  • native_id (str) – The id of the batch scheduler job to attach to.

Return type

None

cancel(job)[source]

Cancels a job if it has not otherwise completed.

A command is constructed using get_cancel_command() and executed in order to cancel the job. Also see cancel().

Parameters

job (Job) –

Return type

None

abstract generate_submit_script(job, context, submit_file)[source]

Called to generate a submit script for a job.

Concrete implementations of batch scheduler executors must override this method in order to generate a submit script for a job.

Parameters
  • job (Job) – The job to be submitted.

  • context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see the description of this class.

  • submit_file (IO[str]) – An opened file-like object to which the contents of the submit script should be written.

Return type

None

abstract get_cancel_command(native_id)[source]

Constructs a command to cancel a batch scheduler job.

Concrete implementations of batch scheduler executors must override this method.

Parameters

native_id (str) – The native id of the job being cancelled.

Returns

A list of strings representing the command and arguments to execute in order to cancel the job, such as, e.g., [‘qdel’, native_id].

Return type

List[str]

abstract get_list_command()[source]

Constructs a command to retrieve the list of jobs known to the LRM for the current user.

Concrete implementations of batch scheduler executors must override this method. Upon running the command, the output can be parsed with parse_list_output().

Returns

A list of strings representing the executable and arguments to invoke in order to obtain the list of jobs the LRM knows for the current user.

Return type

List[str]

abstract get_status_command(native_ids)[source]

Constructs a command to retrieve the status of a list of jobs.

Concrete implementations of batch scheduler executors must override this method. In order to prevent overloading the queueing system, concrete implementations are strongly encouraged to return a command that only queries for the status of the indicated jobs. The command returned by this method should produce an output that is understood by parse_status_output().

Parameters
  • jobs – A collection of native ids corresponding to the jobs whose status is sought.

  • native_ids (Collection[str]) –

Returns

A list of strings representing the command and arguments to execute in order to get the status of the jobs.

Return type

List[str]

abstract get_submit_command(job, submit_file_path)[source]

Constructs a command to submit a job to a batch scheduler.

Concrete implementations of batch scheduler executors must override this method.

Parameters
Returns

A list of strings representing the command and arguments to execute in order to submit the job, such as [‘qsub’, str(submit_file_path)].

Return type

List[str]

abstract job_id_from_submit_output(out)[source]

Extracts a native job id from the output of the submit command.

Concrete implementations of batch scheduler executors must override this method. This method is only invoked if the submit command completes with a zero exit code, so implementations of this method do not need to determine whether the output reflects an error from the submit command.

Parameters

out (str) – The output from the submit command.

Returns

A string representing the native id of the newly submitted job.

Return type

str

list()[source]

Returns a list of jobs known to the underlying implementation.

See list(). The returned list is a list of native_id strings representing jobs known to the underlying batch scheduler implementation, whether submitted through this executor or not. Implementations are encouraged to restrict the results to jobs accessible by the current user.

Return type

List[str]

parse_list_output(out)[source]

Parses the output of the command obtained from get_list_command().

The default implementation of this method assumes that the output has no header and consists of native IDs, one per line, possibly surrounded by whitespace. Concrete implementations should override this method if a different format is expected.

Parameters

out (str) – The output from the “list” command as returned by get_list_command().

Returns

A list of strings representing the native IDs of the jobs known to the LRM for the current user.

Return type

List[str]

abstract parse_status_output(exit_code, out)[source]

Parses the output of a job status command.

Concrete implementations of batch scheduler executors must override this method. The output is meant to have been produced by the command generated by get_status_command().

Parameters
Returns

A dictionary mapping native job ids to JobStatus objects. The implementation of this method need not process the exit code file or the script output file since it is done by the base BatchSchedulerExecutor implementation.

Return type

Dict[str, JobStatus]

abstract process_cancel_command_output(exit_code, out)[source]

Handle output from a failed cancel command.

The main purpose of this method is to help distinguish between the cancel command failing due to an invalid job state (such as the job having completed before the cancel command was invoked) and other types of errors. Since job state errors are ignored, there are two options:

1. Instruct the cancel command to not fail on invalid state errors and have this method always raise a SubmitException, since it is only invoked on “other” errors.

2. Have the cancel command fail on both invalid state errors and other errors and interpret the output from the cancel command to distinguish between the two and raise the appropriate exception.

Parameters
  • exit_code (int) – The exit code from the cancel command.

  • out (str) – The output from the cancel command.

Raises
  • InvalidJobStateError – Raised if the job cancellation has failed because the job was in a completed or failed state at the time when the cancellation command was invoked.

  • SubmitException – Raised for all other reasons.

Return type

None

submit(job)[source]

See submit().

Parameters

job (Job) –

Return type

None

class BatchSchedulerExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: JobExecutorConfig

A base configuration class for BatchSchedulerExecutor implementations.

When subclassing BatchSchedulerExecutor, specific configuration classes inheriting from this class should be defined, even if empty.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

exception InvalidJobStateError[source]

Bases: Exception

An exception that signals that a job cannot be cancelled due to it being already done.

check_status_exit_code(command, exit_code, out)[source]

Check if exit_code is nonzero and, if so, raise a RuntimeError.

This function produces a somewhat user-friendly exception message that combines the command that was run with its output.

Parameters
  • command (str) – The command that was run. This is only used to format the error message.

  • exit_code (int) – The exit code returned by running the command.

  • out (str) – The output produced by command.

Return type

None

psij.executors.batch.cobalt module

Defines a JobExecutor for the Cobalt resource manager.

class CobaltExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Cobalt executor.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

class CobaltJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Cobalt Workload Manager.

The Cobalt HPC Job Scheduler, is used by Argonne’s ALCF systems.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #COBALT directives when submitting a job.

Custom attributes prefixed with cobalt. are rendered as long-form directives in the script. For example, setting custom_attributes[‘cobalt.m’] = ‘co’ results in the #COBALT –m=co directive being placed in the submit script.

Parameters
Return type

None

generate_submit_script(job, context, submit_file)[source]

See generate_submit_script().

Parameters
Return type

None

get_cancel_command(native_id)[source]

See get_cancel_command().

Parameters

native_id (str) –

Return type

List[str]

get_list_command()[source]

See get_list_command().

Return type

List[str]

get_status_command(native_ids)[source]

See get_status_command().

Parameters

native_ids (Collection[str]) –

Return type

List[str]

get_submit_command(job, submit_file_path)[source]

See get_submit_command().

Parameters
  • job (Job) –

  • submit_file_path (Path) –

Return type

List[str]

job_id_from_submit_output(out)[source]

See job_id_from_submit_output().

Parameters

out (str) –

Return type

str

parse_status_output(exit_code, out)[source]

See parse_status_output().

Parameters
  • exit_code (int) –

  • out (str) –

Return type

Dict[str, JobStatus]

process_cancel_command_output(exit_code, out)[source]

See process_cancel_command_output().

This should be unnecessary because qdel only seems to fail on non-integer job IDs.

Parameters
  • exit_code (int) –

  • out (str) –

Return type

None

psij.executors.batch.escape_functions module

bash_escape(o)[source]

Escape object to bash string.

Renders and escapes an object to a string such that its value is preserved when substituted in a bash script between double quotes. Numeric values are simply rendered without any escaping. Path objects are converted to absolute path and escaped. All other objects are converted to string and escaped.

Parameters

o (object) – The object to escape.

Returns

An escaped representation of the object that can be substituted in bash scripts.

Return type

str

psij.executors.batch.lsf module

Defines the LsfJobExecutor class and its config class.

class LsfExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the LSF executor.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

class LsfJobExecutor(url, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the LSF Workload Manager.

The IBM Spectrum LSF workload manager is the system resource manager on LLNL’s Sierra and Lassen, and ORNL’s Summit.

Uses the ‘bsub’, ‘bjobs’, and ‘bkill’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #BSUB directives when submitting a job.

Renders all custom attributes of the form lsf.<name> into the corresponding LSF directive. For example, setting job.spec.attributes.custom_attributes[‘lsf.core_isolation’] = ‘0’ results in a `#BSUB -core_isolation 0 directive being placed in the submit script.

Parameters
generate_submit_script(job, context, submit_file)[source]

See generate_submit_script().

Parameters
Return type

None

get_cancel_command(native_id)[source]

See get_cancel_command().

bkill will exit with an error set if the job does not exist or has already finished.

Parameters

native_id (str) –

Return type

List[str]

get_list_command()[source]

See get_list_command().

Return type

List[str]

get_status_command(native_ids)[source]

See get_status_command().

Parameters

native_ids (Collection[str]) –

Return type

List[str]

get_submit_command(job, submit_file_path)[source]

See get_submit_command().

Parameters
  • job (Job) –

  • submit_file_path (Path) –

Return type

List[str]

job_id_from_submit_output(out)[source]

See job_id_from_submit_output().

Parameters

out (str) –

Return type

str

parse_status_output(exit_code, out)[source]

See parse_status_output().

Iterate through the RECORDS entry, grabbing JOBID and STAT entries, as well as any state-change reasons if present.

Parameters
  • exit_code (int) –

  • out (str) –

Return type

Dict[str, JobStatus]

process_cancel_command_output(exit_code, out)[source]

See process_cancel_command_output().

Check if the error was raised only because a job already exited.

Parameters
  • exit_code (int) –

  • out (str) –

Return type

None

psij.executors.batch.pbspro module

psij.executors.batch.script_generator module

class SubmitScriptGenerator(config)[source]

Bases: ABC

A base class representing a submit script generator.

A submit script generator is used to render a Job (together with all its properties, including JobSpec, ResourceSpec, etc.) into a submit script specific to a certain batch scheduler.

Parameters

config (Optional[JobExecutorConfig]) – An executor configuration containing configuration properties for the executor that is attempting to use this generator. Submit script generators are meant to work in close cooperation with batch scheduler job executors, hence the sharing of a configuration mechanism.

Return type

None

generate_submit_script(job, context, out)[source]

Generates a job submit script.

Concerete implementations of submit script generators must implement this method. Its purpose is to generate the content of the submit script. For an extensive explanation of the mechanism behind this process, see BatchSchedulerExecutor.

Parameters
  • job (Job) – The job for which the submit script is to be generated.

  • context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see BatchSchedulerExecutor.

  • out (IO[str]) – An opened file-like object to which the contents of the submit script should be written.

Return type

None

class TemplatedScriptGenerator(config, template_path, escape=<function bash_escape>)[source]

Bases: SubmitScriptGenerator

A Mustache templates submit script generator.

This script generator uses Pystache (https://pypi.org/project/pystache/), which is a Python implementation of the Mustache templating language (https://mustache.github.io/).

Parameters
  • config (Optional[JobExecutorConfig]) – A configuration, which is passed to the base class.

  • template_path (Path) – The path to a Mustache template.

  • escape (Callable[[object], str]) – An escape function to use for escaping values. By default, a function that escapes strings for use in bash scripts is used.

Return type

None

generate_submit_script(job, context, out)[source]

See generate_submit_script().

Renders a submit script using the template specified when this generator was constructed.

Parameters
Return type

None

psij.executors.batch.slurm module

class SlurmExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Slurm executor.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

class SlurmJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Slurm Workload Manager.

The Slurm Workload Manager is a widely used resource manager running on machines such as NERSC’s Perlmutter, as well as a variety of LLNL machines.

Uses the ‘sbatch’, ‘squeue’, and ‘scancel’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #SBATCH directives when submitting a job.

Renders all custom attributes set on a job’s attributes with a slurm. prefix into corresponding Slurm directives with long-form parameters. For example, job.spec.attributes.custom_attributes[‘slurm.qos’] = ‘debug’ causes a directive #SBATCH –qos=debug to be placed in the submit script.

Parameters
generate_submit_script(job, context, submit_file)[source]

See generate_submit_script().

Parameters
Return type

None

get_cancel_command(native_id)[source]

See get_cancel_command().

Parameters

native_id (str) –

Return type

List[str]

get_list_command()[source]

See get_list_command().

Return type

List[str]

get_status_command(native_ids)[source]

See get_status_command().

Parameters

native_ids (Collection[str]) –

Return type

List[str]

get_submit_command(job, submit_file_path)[source]

See get_submit_command().

Parameters
  • job (Job) –

  • submit_file_path (Path) –

Return type

List[str]

job_id_from_submit_output(out)[source]

See job_id_from_submit_output().

Parameters

out (str) –

Return type

str

parse_status_output(exit_code, out)[source]

See parse_status_output().

Parameters
  • exit_code (int) –

  • out (str) –

Return type

Dict[str, JobStatus]

process_cancel_command_output(exit_code, out)[source]

See process_cancel_command_output().

Parameters
  • exit_code (int) –

  • out (str) –

Return type

None

psij.executors.batch.template_function_library module

ALL: Dict[str, Callable[[...], Any]] = {'walltime_to_minutes': <function walltime_to_minutes>}

A dictionary of all template-accessible functions for the batch executor templating mechanism.

The dictionary which maps function names to their implementation. All public functions in this module are present in this dictionary and their corresponding keys are the same as their names.

walltime_to_minutes(walltime)[source]

Converts a walltime object to a number of minutes.

The walltime can either be a Python timedelta, an integer, in which case it is interpreted directly as a number of minutes, or a string with a format of either HH:MM:SS, HH:MM, or MM.

Parameters

walltime (Union[timedelta, int, str]) – the walltime to convert

Returns

The number of minutes represented by the walltime parameter.

Return type

int

Module contents

A package containing infrastructure for implementing batch scheduler executors.