The PSI/J API

The most important classes in this library are Job and JobExecutor, followed by Launcher.

The Job Class and Its Modifiers

The Job-related classes listed in this section (Job, JobSpec, ResourceSpec, and JobAttributes) are independent of executor implementations. The authors strongly recommend that users program against these classes rather than adding executor-specific configuration options, to the extent possible.

class Job(spec=None)[source]

Bases: object

This class represents a PSI/J job.

It encapsulates all of the information needed to run a job as well as the job’s state.

When constructed, a job is in the NEW state.

Parameters

spec (Optional[JobSpec]) – an optional JobSpec that describes the details of the job.

Return type

None

cancel()[source]

Cancels this job.

The job is canceled by calling cancel() on the job executor that was used to submit this job.

Raises

SubmitException – if the job has not yet been submitted.

Return type

None

property id: str

A read-only property containing the PSI/J job ID.

The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.

property native_id: Optional[str]

A read-only property containing the native ID of the job.

The native ID is the ID assigned to the job by the underlying implementation. The native ID may not be available until after the job is submitted to a JobExecutor, in which case the value of this property is None.

set_job_status_callback(cb)[source]

Registers a status callback with this job.

The callback can either be a subclass of JobStatusCallback or a procedure accepting two arguments: a Job and a JobStatus.

The callback is invoked whenever a status change occurs for this job, independent of any callback registered on the job’s JobExecutor. The callback can be removed by setting this property to None.

Parameters

cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of JobStatusCallback or a callable with two parameters, job of type Job, job_status of type JobStatus, and returning nothing.

Return type

None

spec

The job specification of this job.

property status: JobStatus

Contains the current status of the job.

It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of JobStatus types. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.

Returns

the current state of this job

wait(timeout=None, target_states=None)[source]

Waits for the job to reach certain states.

This method returns either when the job reaches one of the target_states, a state following one of the target_states, a final state, or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the JobStatus object that has one of the desired states or None if the timeout is reached. For example, wait(target_states = [JobState.QUEUED] waits until the job is in any of the QUEUED, ACTIVE, COMPLETED, FAILED, or CANCELED states.

Parameters
  • timeout (Optional[timedelta]) – An optional timeout after which this method returns even if none of the target_states was reached. If not specified, wait indefinitely.

  • target_states (Optional[Union[JobState, Sequence[JobState]]]) – A set of states to wait for. If not specified, wait for any of the final states.

Returns

returns the JobStatus object that caused the caused this call to complete or None if the timeout is specified and reached.

Return type

Optional[JobStatus]

class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]

Bases: object

A class containing details about job transitions to new states.

Parameters
  • state (JobState) – The JobState of this status.

  • time (Optional[float]) – The time, as would be returned by time.time(), at which the transition to the new state occurred. If not specified, the time when this JobStatus was instantiated will be used.

  • message (Optional[str]) – An optional message associated with the transition.

  • exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.

  • metadata (Optional[Dict[str, object]]) – Optional metadata provided by the JobExecutor.

Return type

None

All constructor parameters are accessible as properties.

property final: bool

Returns the final property of the underlying state.

Returns

True if the state is final and False otherwise.

class JobState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: bytes, Enum

An enumeration holding the possible job states.

The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.

ACTIVE = 3

This state represents an actively running job.

CANCELED = 8

Represents a job that was canceled by a call to cancel().

CLEANUP = 5

This state indicates that cleanup is actively being done for this job.

COMPLETED = 6

This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.

FAILED = 7

Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.

NEW = 0

This is the state of a job immediately after the Job object is created and before being submitted to a JobExecutor.

QUEUED = 1

This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.

STAGE_IN = 2

This state indicates that the job is staging files in, in preparation for execution.

STAGE_OUT = 4

This state indicates that the executable has finished running and that files are being staged out.

property final: bool

Returns True if this state final.

A state is final when no other state transition can occur after that state has been reached.

Returns

True if this is a final state and False otherwise

static from_name(name)[source]

Returns a JobState object corresponding to its string representation.

This method is such that state == JobState.from_name(str(state)).

Parameters

name (str) –

Return type

JobState

is_greater_than(other)[source]

Defines a (strict) partial ordering on the states.

Not all states are comparable. State transitions cannot violate this ordering.

Parameters

other (JobState) – the other JobState to compare to

Returns

if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.

Return type

Optional[bool]

Job Modifiers

There can be a lot of configuration information that goes into each resource manager job. Its walltime, partition/queue, the number of nodes it needs, what kind of nodes, what quality of service the job requires, and so on.

PSI/J splits those attributes into three groups: one for generic POSIX information, one for resource information, and one for resource manager scheduling policies.

class JobSpec(executable=None, arguments=None, directory=None, name=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None, stage_in=None, stage_out=None, cleanup=None, cleanup_flags=StageOutFlags.ALWAYS)[source]

Bases: object

A class that describes the details of a job.

Parameters
  • executable (Optional[str]) – An executable, such as “/bin/date”.

  • arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.

  • directory (Union[str, Path, None]) – The directory, on the compute side, in which the executable is to be run

  • name (Optional[str]) – A name for the job. The name plays no functional role except that JobExecutor implementations may attempt to use the name to label the job as presented by the underlying implementation.

  • inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.

  • environment (Optional[Dict[str, Union[str, int]]]) – A mapping of environment variable names to their respective values.

  • stdin_path (Union[str, Path, None]) – Path to a file whose contents will be sent to the job’s standard input.

  • stdout_path (Union[str, Path, None]) – A path to a file in which to place the standard output stream of the job.

  • stderr_path (Union[str, Path, None]) – A path to a file in which to place the standard error stream of the job.

  • resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.

  • attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.

  • pre_launch (Union[str, Path, None]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.

  • post_launch (Union[str, Path, None]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.

  • launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers, see Available Launchers.

  • stage_in (Optional[Set[StageIn]]) – Specifies a set of files to be staged in before the job is launched.

  • stage_out (Optional[Set[StageOut]]) – Specifies a set of files to be staged out after the job terminates.

  • cleanup (Optional[Set[Union[str, Path]]]) – Specifies a set of files to remove after the stage out process.

  • cleanup_flags (StageOutFlags) – Specifies the conditions under which the files in cleanup should be removed, such as when the job completes successfully. The flag StageOutFlags.IF_PRESENT is ignored and no error condition is triggered if a file specified by the cleanup argument is not present.

All constructor parameters are accessible as properties.

Note

A note about paths.

It is strongly recommended that paths to std*_path, directory, etc. be specified as absolute. While paths can be relative, and there are cases when it is desirable to specify them as relative, it is important to understand what the implications are.

Paths in a specification refer to paths that are accessible to the machine where the job is running. In most cases, that will be different from the machine on which the job is launched (i.e., where PSI/J is invoked from). This means that a given path may or may not point to the same file in both the location where the job is running and the location where the job is launched from.

For example, if launching jobs from a login node of a cluster, the path /tmp/foo.txt will likely refer to locally mounted drives on both the login node and the compute node(s) where the job is running. However, since they are local mounts, the file /tmp/foo.txt written by a job running on the compute node will not be visible by opening /tmp/foo.txt on the login node. If an output file written on a compute node needs to be accessed on a login node, that file should be placed on a shared filesystem. However, even by doing so, there is no guarantee that the shared filesystem is mounted under the same mount point on both login and compute nodes. While this is an unlikely scenario, it remains a possibility.

When relative paths are specified, even when they point to files on a shared filesystem as seen from the submission side (i.e., login node), the job working directory may be different from the working directory of the application that is launching the job. For example, an application that uses PSI/J to launch jobs on a cluster may be invoked from (and have its working directory set to) /home/foo, where /home is a mount point for a shared filesystem accessible by compute nodes. The launched job may specify stdout_path=Path(‘bar.txt’), which would resolve to /home/foo/bar.txt. However, the job may start in /tmp on the compute node, and its standard output will be redirected to /tmp/bar.txt.

Relative paths are useful when there is a need to refer to the job directory that the scheduler chooses for the job, which is not generally known until the job is started by the scheduler. In such a case, one must leave the spec.directory attribute empty and refer to files inside the job directory using relative paths.

property cleanup: Optional[Set[Path]]

An optional set of cleanup directives.

property directory: Optional[Path]

The directory, on the compute side, in which the executable is to be run.

property environment: Optional[Dict[str, str]]

Return the environment dict.

property name: Optional[str]

Returns the name of the job.

property post_launch: Optional[Path]

An optional path to a post-launch script.

The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.

property pre_launch: Optional[Path]

An optional path to a pre-launch script.

The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.

property stderr_path: Optional[Path]

A path to a file in which to place the standard error stream of the job.

property stdin_path: Optional[Path]

A path to a file whose contents will be sent to the job’s standard input.

property stdout_path: Optional[Path]

A path to a file in which to place the standard output stream of the job.

class ResourceSpec[source]

Bases: ABC

A base class for resource specifications.

The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.

static get_instance(version)[source]

Creates an instance of a ResourceSpec of the specified version.

Parameters

version (int) – The version of ResourceSpec to instantiate. For example, if version == 1, this method will return a new instance of ResourceSpecV1.

Return type

ResourceSpec

abstract property version: int

Returns the version of this resource specification class.

class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, account=None, reservation_id=None, custom_attributes=None, project_name=None)[source]

Bases: object

A class containing ancillary job information that describes how a job is to be run.

Parameters
  • duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.

  • queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.

  • account (Optional[str]) – An account to use for billing purposes. Please note that the executor implementation (or batch scheduler) may use a different term for the option used for accounting/billing purposes, such as project. However, scheduler must map this attribute to the accounting/billing option in the underlying execution mechanism.

  • reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.

  • custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of JobExecutor define and are responsible for interpreting custom attributes. The typical usage scenario for custom attributes is to pass information to the executor or underlying job execution mechanism that cannot otherwise be passed using the classes and properties provided by PSI/J. A specific example is that of the subclasses of BatchSchedulerExecutor, which look for custom attributes prefixed with their name and a dot (e.g., slurm.constraint, pbs.c, lsf.core_isolation) and translate them into the corresponding batch scheduler directives (e.g., #SLURM –constraint=…, #PBS -c …, #BSUB -core_isolation …).

  • project_name (Optional[str]) – Deprecated. Please use the account attribute.

Return type

None

All constructor parameters are accessible as properties.

property custom_attributes: Optional[Dict[str, object]]

Returns a dictionary with the custom attributes.

get_custom_attribute(name)[source]

Retrieves the value of a custom attribute.

Parameters

name (str) –

Return type

Optional[object]

static parse_walltime(walltime)[source]

Parses a walltime string into a timedelta.

The accepted walltime strings formats are: * hh:mm:ss * hh:mm * mm * ns*[y|M|d|h|ms]

Parameters

walltime (str) – A string in one of the above formats representing a time duration

Returns

A timedelta representing the same time duration as the walltime parameter.

Return type

timedelta

property project_name: Optional[str]

Deprecated. Please use the account attribute.

set_custom_attribute(name, value)[source]

Sets a custom attribute.

Parameters
Return type

None

Executors

Executors are concrete implementations of mechanisms that execute jobs. To get an instance of a specific executor, call JobExecutor.get_instance(name), with name being one of the installed executor names. Alternatively, directly instantiate the executor, e.g.:

from psij.executors.flux import FluxJobExecutor

ex = FluxJobExecutor()

Rather than:

import psij

ex = psij.JobExecutor.get_instance('flux')

Executors can be installed from multiple sources, so the precise list of executors available to a specific installation of the PSI/J Python library can vary. In order to get a list of available executors, you can run, in a terminal:

$ python -m psij plugins

JobExecutor Base Class

The psij.JobExecutor class is abstract, but offers concrete static methods for registering, fetching, and listing subclasses of itself.

class JobExecutor(url=None, config=None)[source]

Bases: ABC

An abstract base class for all JobExecutor implementations.

Parameters
  • url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.

  • config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.

The concrete executor implementations provided by this version of PSI/J Python are:

Cobalt

class CobaltJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Cobalt Workload Manager.

The Cobalt HPC Job Scheduler, is used by Argonne’s ALCF systems.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #COBALT directives when submitting a job.

Custom attributes prefixed with cobalt. are rendered as long-form directives in the script. For example, setting custom_attributes[‘cobalt.m’] = ‘co’ results in the #COBALT –m=co directive being placed in the submit script.

Parameters
Return type

None

Flux

class FluxJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A JobExecutor for the Flux scheduler.

The Flux resource manager framework is deployed and used on a per-user basis at many sites, and is slated to become the system-level resource manager at LLNL.

Uses Flux’s python library/bindings to submit, monitor, and manipulate jobs.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (Optional[JobExecutorConfig]) – The FluxJobExecutor does not have any configuration options.

Return type

None

LSF

class LsfJobExecutor(url, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the LSF Workload Manager.

The IBM Spectrum LSF workload manager is the system resource manager on LLNL’s Sierra and Lassen, and ORNL’s Summit.

Uses the ‘bsub’, ‘bjobs’, and ‘bkill’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #BSUB directives when submitting a job.

Renders all custom attributes of the form lsf.<name> into the corresponding LSF directive. For example, setting job.spec.attributes.custom_attributes[‘lsf.core_isolation’] = ‘0’ results in a `#BSUB -core_isolation 0 directive being placed in the submit script.

Parameters

PBS

Slurm

class SlurmJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Slurm Workload Manager.

The Slurm Workload Manager is a widely used resource manager running on machines such as NERSC’s Perlmutter, as well as a variety of LLNL machines.

Uses the ‘sbatch’, ‘squeue’, and ‘scancel’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #SBATCH directives when submitting a job.

Renders all custom attributes set on a job’s attributes with a slurm. prefix into corresponding Slurm directives with long-form parameters. For example, job.spec.attributes.custom_attributes[‘slurm.qos’] = ‘debug’ causes a directive #SBATCH –qos=debug to be placed in the submit script.

Parameters

Local

class LocalJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A job executor that runs jobs locally using subprocess.Popen.

This job executor is intended to be used either to run jobs directly on the same machine as the PSI/J library or for testing purposes.

Note

In Linux, attached jobs always appear to complete with a zero exit code regardless of the actual exit code.

Warning

Instantiation of a local executor from both parent process and a fork()-ed process is not guaranteed to work. In general, using fork() and multi-threading in Linux is unsafe, as suggested by the fork() man page. While PSI/J attempts to minimize problems that can arise when fork() is combined with threads (which are used by PSI/J), no guarantees can be made and the chances of unexpected behavior are high. Please do not use PSI/J with fork(). If you do, please be mindful that support for using PSI/J with fork() will be limited.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (JobExecutorConfig) – The LocalJobExecutor does not have any configuration options.

Return type

None

Radical Pilot

class RPJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A job executor that runs jobs via the RADICAL Pilot system.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (Optional[JobExecutorConfig]) – The RPJobExecutor does not have any configuration options.

Return type

None

Launchers

Launchers are mechanisms to start the actual jobs on batch schedulers once a set of nodes has been allocated for the job. In essence, launchers are wrappers around the job executable which can provide additional features, such as setting up an MPI environment, starting a copy of the job executable on each allocated node, etc. To get a launcher instance, call Launcher.get_instance(name) with name being the name of a launcher. Like job executors, launchers are plugins and can come from various places. To obtain a list of launchers, you can run:

$ python -m psij plugins

Launcher Base Class

Like the executor, the Launcher base class is abstract, but offers concrete static methods for registering and fetching subclasses of itself.

class Launcher(config=None)[source]

Bases: ABC

An abstract base class for all launchers.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration. If not specified, DEFAULT is used.

Return type

None

The PSI/J Python library comes with a core set of launchers, which are:

aprun

class AprunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using Cobalt’s aprun.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

jsrun

class JsrunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using LSF’s jsrun.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

srun

class SrunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using Slurm’s srun.

See the Slurm Workload Manager.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

mpirun

class MPILauncher(config=None)[source]

Bases: MultipleLauncher

Launches jobs using mpirun.

mpirun is a tool provided by MPI implementations, such as Open MPI.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

single

class SingleLauncher(config=None)[source]

Bases: ScriptBasedLauncher

A launcher that launches a single copy of the executable. This is the default launcher.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

multiple

class MultipleLauncher(script_path=PosixPath('/home/runner/work/psij-python/psij-python/src/psij/launchers/scripts/multi_launch.sh'), config=None)[source]

Bases: ScriptBasedLauncher

A launcher that launches multiple identical copies of the executable.

The exit code of the job corresponds to the first non-zero exit code encountered in one of the executable copies or zero if all invocations of the executable succeed.

Parameters
get_additional_args(job)[source]

See get_additional_args().

Parameters

job (Job) –

Return type

List[str]

Other Package Contents

A collection of exceptions used by PSI/J.

exception InvalidJobException(message, exception=None)[source]

Bases: Exception

An exception describing a problem with a job specification.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

exception SubmitException(message, exception=None, transient=False)[source]

Bases: Exception

An exception representing job submission issues.

This exception is thrown when the submit() call fails for a reason that is independent of the job that is being submitted.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

transient

Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.