Index

Classes

AprunLauncher

classAprunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using Cobalt’s aprun.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

BatchSchedulerExecutor

classBatchSchedulerExecutor(url=None, config=None)[source]

Bases: JobExecutor

A base class for batch scheduler executors.

This class implements a generic JobExecutor that interacts with batch schedulers. There are two main components to the executor: job submission and queue polling. Submission is implemented by generating a submit script which is then fed to the queuing system submit command.

The submit script is generated using a generate_submit_script(). An implementation of this functionality based on Mustache/Pystache (see https://mustache.github.io/ and https://pypi.org/project/pystache/) exists in TemplatedScriptGenerator. This class can be instantiated by concrete implementations of a batch scheduler executor and the submit script generation can be delegated to that instance, which has a method whose signature matches that of generate_submit_script(). Besides an opened file which points to where the contents of the submit script are to be written, the parameters to generate_submit_script() are the Job that is being submitted and a context, which is a dictionary with the following structure:

{
    'job': <the job being submitted>
    'psij': {
        'lib': <dict; function library>,
        'launch_command': <str; launch command>,
        'script_dir': <str; directory where the submit script is generated>
    }
}

The script directory is a directory (typically ~/.psij/work) where submit scripts are written; it is also used for auxiliary files, such as the exit code file (see below) or the script output file.

The launch command is a list of strings which the script generator should render as the command to execute. It wraps the job executable in the proper Launcher.

The function library is a dictionary mapping function names to functions for all public functions in the template_function_library module.

The submit script must perform two essential actions:

1. redirect the output of the executable part of the script to the script output file, which is a file in <script_dir> named <native_id>.out, where <native_id> is the id given to the job by the queuing system.

2. store the exit code of the launch command in the exit code file named <native_id>.ec, also inside <script_dir>.

Additionally, where appropriate, the submit script should set the environment variable named PSIJ_NODEFILE to point to a file containing a list of nodes that are allocated for the job, one per line, with a total number of lines matching the process count of the job.

Once the submit script is generated, the executor renders the submit command using get_submit_command() and executes it. Its output is then parsed using job_id_from_submit_output() to retrieve the native_id of the job. Subsequently, the job is registered with the queue polling thread.

The queue polling thread regularly polls the batch scheduler queue for updates to job states. It builds the command for polling the queue using get_status_command(), which takes a list of native_id strings corresponding to all registered jobs. Implementations are strongly encouraged to restrict the query of job states to the specified jobs in order to reduce the load on the queuing system. The output of the status command is then parsed using parse_status_output() and the status of each job is updated accordingly. If the status of a registered job is not found in the output of the queue status command, it is assumed completed (or failed, depending on its exit code), since most queuing systems automatically purge completed jobs from their databases after a short period of time. The exit code is read from the exit code file, as described above. If the exit code value is not zero, the job is assumed failed and an attempt is made to read an error message from the script output file.

Parameters

BatchSchedulerExecutorConfig

classBatchSchedulerExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: JobExecutorConfig

A base configuration class for BatchSchedulerExecutor implementations.

When subclassing BatchSchedulerExecutor, specific configuration classes inheriting from this class should be defined, even if empty.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

CobaltExecutorConfig

classCobaltExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Cobalt executor.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

CobaltJobExecutor

classCobaltJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Cobalt Workload Manager.

The Cobalt HPC Job Scheduler, is used by Argonne’s ALCF systems.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #COBALT directives when submitting a job.

Custom attributes prefixed with cobalt. are rendered as long-form directives in the script. For example, setting custom_attributes[‘cobalt.m’] = ‘co’ results in the #COBALT –m=co directive being placed in the submit script.

Parameters
Return type

None

Descriptor

classDescriptor(name, version, cls, aliases=None, nice_name=None)[source]

Bases: object

This class is used to enable PSI/J to discover and register executors and/or launchers.

Executors wanting to register with PSI/J must place an instance of this class in a global module list named __PSI_J_EXECUTORS__ or __PSI_J_LAUNCHERS__ in a module placed in the psij-descriptors namespace package. In other words, in order to automatically register an executor or launcher, a python file should be created inside a psij-descriptors package, such as:

<project_root>/
    src/
        psij-descriptors/
            descriptors_for_project.py

It is essential that the psij-descriptors package not contain an __init__.py file in order for Python to treat the package as a namespace package. This allows Python to combine multiple psij-descriptors directories into one, which, in turn, allows PSI/J to detect and load all descriptors that can be found in Python’s library search path.

The contents of descriptors_for_project.py could then be as follows:

from packaging.version import Version
from psij.descriptor import Descriptor

__PSI_J_EXECUTORS__ = [
    Descriptor(name=<name>, version=Version(<version_str>),
               cls=<fqn_str>),
    ...
]

__PSI_J_LAUNCHERS__ = [
    Descriptor(name=<name>, version=Version(<version_str>),
               cls=<fqn_str>),
    ...
]

where <name> stands for the name used to instantiate the executor or launcher, <version_str> is a version string such as 1.0.2, and <fqn_str> is the fully qualified class name that implements the executor or launcher such as psij.executors.local.LocalJobExecutor.

Parameters
  • name (str) – The name of the executor or launcher. The automatic registration system will register the executor or launcher using this name. That is, the executor or launcher represented by this descriptor will be available for instantiation using either get_instance() or get_instance()

  • version (Version) – The version of the executor/launcher. Multiple versions can be registered under a single name.

  • cls (str) – A fully qualified name pointing to the class implementing an executor or launcher.

  • aliases (Optional[List[str]]) – An optional set of alternative names to make the executor available under as if its name was the alias.

  • nice_name (Optional[str]) – An optional string to use whenever a user-friendly name needs to be displayed to a user. For example, a nice name for pbs would be PBS or Portable Batch System. If not specified, the nice_name defaults to the value of the name parameter.

Return type

None

FunctionJobStatusCallback

GenericPBSJobExecutor

InvalidJobException

classInvalidJobException(message, exception=None)[source]

Bases: Exception

An exception describing a problem with a job specification.

Parameters
Return type

None

InvalidJobStateError

classInvalidJobStateError[source]

Bases: Exception

An exception that signals that a job cannot be cancelled due to it being already done.

Job

JobAttributes

JobExecutor

JobExecutorConfig

JobSpec

JobState

JobStateOrder

JobStatus

JobStatusCallback

JsrunLauncher

classJsrunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using LSF’s jsrun.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

Launcher

LocalJobExecutor

classLocalJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A job executor that runs jobs locally using subprocess.Popen.

This job executor is intended to be used either to run jobs directly on the same machine as the PSI/J library or for testing purposes.

Note

In Linux, attached jobs always appear to complete with a zero exit code regardless of the actual exit code.

Warning

Instantiation of a local executor from both parent process and a fork()-ed process is not guaranteed to work. In general, using fork() and multi-threading in Linux is unsafe, as suggested by the fork() man page. While PSI/J attempts to minimize problems that can arise when fork() is combined with threads (which are used by PSI/J), no guarantees can be made and the chances of unexpected behavior are high. Please do not use PSI/J with fork(). If you do, please be mindful that support for using PSI/J with fork() will be limited.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (JobExecutorConfig) – The LocalJobExecutor does not have any configuration options.

Return type

None

LsfExecutorConfig

classLsfExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the LSF executor.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

LsfJobExecutor

classLsfJobExecutor(url, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the LSF Workload Manager.

The IBM Spectrum LSF workload manager is the system resource manager on LLNL’s Sierra and Lassen, and ORNL’s Summit.

Uses the ‘bsub’, ‘bjobs’, and ‘bkill’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #BSUB directives when submitting a job.

Renders all custom attributes of the form lsf.<name> into the corresponding LSF directive. For example, setting job.spec.attributes.custom_attributes[‘lsf.core_isolation’] = ‘0’ results in a `#BSUB -core_isolation 0 directive being placed in the submit script.

Parameters

MPILauncher

classMPILauncher(config=None)[source]

Bases: MultipleLauncher

Launches jobs using mpirun.

mpirun is a tool provided by MPI implementations, such as Open MPI.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

MultipleLauncher

classMultipleLauncher(script_path=PosixPath('/home/runner/work/psij-python/psij-python/src/psij/launchers/scripts/multi_launch.sh'), config=None)[source]

Bases: ScriptBasedLauncher

A launcher that launches multiple identical copies of the executable.

The exit code of the job corresponds to the first non-zero exit code encountered in one of the executable copies or zero if all invocations of the executable succeed.

Parameters

PBSClassicJobExecutor

classPBSClassicJobExecutor(url=None, config=None)[source]

Bases: GenericPBSJobExecutor

A JobExecutor for classic PBS systems.

This executor uses resource specifications specific to Open PBS. Specifically, this executor uses the -l nodes=n:ppn=m way of specifying nodes, which differs from the scheme used by PBS Pro.

Parameters

PBSExecutorConfig

PBSJobExecutor

classPBSJobExecutor(url=None, config=None)[source]

Bases: GenericPBSJobExecutor

A JobExecutor for PBS Pro and friends.

This executor uses resource specifications specific to PBS Pro

Parameters

RPJobExecutor

classRPJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A job executor that runs jobs via the RADICAL Pilot system.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (Optional[JobExecutorConfig]) – The RPJobExecutor does not have any configuration options.

Return type

None

ResourceSpec

ResourceSpecV1

ScriptBasedLauncher

classScriptBasedLauncher(script_path, config=None)[source]

Bases: Launcher

A launcher that uses a script to start the job, possibly by wrapping it in other tools.

This launcher is an abstract base class for launchers that wrap the job in a script. The script must be a bash script and is invoked with the first four parameters as:

  • the job ID

  • a launcher log file, which is taken from the launcher_log_file configuration setting and defaults to /dev/null

  • the pre- and post- launcher scripts, or empty strings if they are not specified

Additional positional arguments to the script can be specified by subclasses by overriding the get_additional_args() method.

The remaining arguments to the script are the job executable and arguments.

A simple script library is provided in scripts/launcher_lib.sh. Its use is optional and it is intended to be included at the beginning of a main launcher script using source $(dirname “$0”)/launcher_lib.sh. It does the following:

  • sets ‘-e’ mode (exit on error)

  • sets the variables _PSI_J_JOB_ID, _PSI_J_LOG_FILE, _PSI_J_PRE_LAUNCH, and _PSI_J_POST_LAUNCH from the first arguments, as specified above.

  • saves the current stdout and stderr in descriptors 3 and 4, respectively

  • redirects stdout and stderr to the log file, while prepending a timestamp and the job ID to each line

  • defines the commands “pre_launch” and “post_launch”, which can be invoked by the main script.

When invoking the job executable (either directly or through a launch command), it is recommended that the stdout and stderr of the job process be redirected to descriptors 3 and 4, respectively, such that they can be captured by the entity invoking the launcher rather than ending up in a the launcher log file.

A successful completion of the launcher should be signalled by the launcher by printing the string “_PSI_J_LAUNCHER_DONE” to stdout. The launcher can then exit with the exit code returned by the launched command. This allows executor to distinguish between a non-zero exit code due to application failure or due to a premature launcher failure.

The actual launcher scripts, as well as the library, are deployed at run-time into the work directory, where submit scripts are also generated. This directory is meant to be accessible by both the node submitting the job as well as the node launching the job.

Parameters
  • script_path (Path) – A path to a script that is invoked as described above.

  • config (Optional[JobExecutorConfig]) – An optional configuration.

Return type

None

SingleLauncher

classSingleLauncher(config=None)[source]

Bases: ScriptBasedLauncher

A launcher that launches a single copy of the executable. This is the default launcher.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

SingletonThread

classSingletonThread(name=None, daemon=False)[source]

Bases: Thread

A convenience class to return a thread that is guaranteed to be unique to this process.

This is intended to work with fork() to ensure that each os.getpid() value is associated with at most one thread. This is not safe. The safe thing, as pointed out by the fork() man page, is to not use fork() with threads. However, this is here in an attempt to make it slightly safer for when users really really want to take the risk against all advice.

This class is meant as an abstract class and should be used by subclassing and implementing the run method.

Instantiation of this class or one of its subclasses should be done through the get_instance() method rather than directly.

Parameters
  • name (Optional[str]) – An optional name for this thread.

  • daemon (bool) – A daemon thread does not prevent the process from exiting.

Return type

None

SlurmExecutorConfig

classSlurmExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Slurm executor.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

SlurmJobExecutor

classSlurmJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Slurm Workload Manager.

The Slurm Workload Manager is a widely used resource manager running on machines such as NERSC’s Perlmutter, as well as a variety of LLNL machines.

Uses the ‘sbatch’, ‘squeue’, and ‘scancel’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #SBATCH directives when submitting a job.

Renders all custom attributes set on a job’s attributes with a slurm. prefix into corresponding Slurm directives with long-form parameters. For example, job.spec.attributes.custom_attributes[‘slurm.qos’] = ‘debug’ causes a directive #SBATCH –qos=debug to be placed in the submit script.

Parameters

SrunLauncher

classSrunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using Slurm’s srun.

See the Slurm Workload Manager.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

SubmitException

classSubmitException(message, exception=None, transient=False)[source]

Bases: Exception

An exception representing job submission issues.

This exception is thrown when the submit() call fails for a reason that is independent of the job that is being submitted.

Parameters
Return type

None

SubmitScriptGenerator

classSubmitScriptGenerator(config)[source]

Bases: ABC

A base class representing a submit script generator.

A submit script generator is used to render a Job (together with all its properties, including JobSpec, ResourceSpec, etc.) into a submit script specific to a certain batch scheduler.

Parameters

config (JobExecutorConfig) – An executor configuration containing configuration properties for the executor that is attempting to use this generator. Submit script generators are meant to work in close cooperation with batch scheduler job executors, hence the sharing of a configuration mechanism.

Return type

None

TemplatedScriptGenerator

classTemplatedScriptGenerator(config, template_path, escape=<function bash_escape>)[source]

Bases: SubmitScriptGenerator

A Mustache templates submit script generator.

This script generator uses Pystache (https://pypi.org/project/pystache/), which is a Python implementation of the Mustache templating language (https://mustache.github.io/).

Parameters
  • config (JobExecutorConfig) – A configuration, which is passed to the base class.

  • template_path (Path) – The path to a Mustache template.

  • escape (Callable[[object], str]) – An escape function to use for escaping values. By default, a function that escapes strings for use in bash scripts is used.

Return type

None

Functions

bash_escape

bash_escape(o)[source]

Escape object to bash string.

Renders and escapes an object to a string such that its value is preserved when substituted in a bash script between double quotes. Numeric values are simply rendered without any escaping. Path objects are converted to absolute path and escaped. All other objects are converted to string and escaped.

Parameters

o (object) – The object to escape.

Returns

An escaped representation of the object that can be substituted in bash scripts.

Return type

str

check_status_exit_code

check_status_exit_code(command, exit_code, out)[source]

Check if exit_code is nonzero and, if so, raise a RuntimeError.

This function produces a somewhat user-friendly exception message that combines the command that was run with its output.

Parameters
  • command (str) – The command that was run. This is only used to format the error message.

  • exit_code (int) – The exit code returned by running the command.

  • out (str) – The output produced by command.

Return type

None

walltime_to_minutes

walltime_to_minutes(walltime)[source]

Converts a walltime object to a number of minutes.

The walltime can either be a Python timedelta, an integer, in which case it is interpreted directly as a number of minutes, or a string with a format of either HH:MM:SS, HH:MM, or MM.

Parameters

walltime (Union[timedelta, int, str]) – the walltime to convert

Returns

The number of minutes represented by the walltime parameter.

Return type

int