psij package

Submodules

psij.descriptor module

Executor/Launcher descriptor module.

class Descriptor(name, version, cls, aliases=None, nice_name=None)[source]

Bases: object

This class is used to enable PSI/J to discover and register executors and/or launchers.

Executors wanting to register with PSI/J must place an instance of this class in a global module list named __PSI_J_EXECUTORS__ or __PSI_J_LAUNCHERS__ in a module placed in the psij-descriptors namespace package. In other words, in order to automatically register an executor or launcher, a python file should be created inside a psij-descriptors package, such as:

<project_root>/
    src/
        psij-descriptors/
            descriptors_for_project.py

It is essential that the psij-descriptors package not contain an __init__.py file in order for Python to treat the package as a namespace package. This allows Python to combine multiple psij-descriptors directories into one, which, in turn, allows PSI/J to detect and load all descriptors that can be found in Python’s library search path.

The contents of descriptors_for_project.py could then be as follows:

from packaging.version import Version
from psij.descriptor import Descriptor

__PSI_J_EXECUTORS__ = [
    Descriptor(name=<name>, version=Version(<version_str>),
               cls=<fqn_str>),
    ...
]

__PSI_J_LAUNCHERS__ = [
    Descriptor(name=<name>, version=Version(<version_str>),
               cls=<fqn_str>),
    ...
]

where <name> stands for the name used to instantiate the executor or launcher, <version_str> is a version string such as 1.0.2, and <fqn_str> is the fully qualified class name that implements the executor or launcher such as psij.executors.local.LocalJobExecutor.

Parameters
  • name (str) – The name of the executor or launcher. The automatic registration system will register the executor or launcher using this name. That is, the executor or launcher represented by this descriptor will be available for instantiation using either get_instance() or get_instance()

  • version (Version) – The version of the executor/launcher. Multiple versions can be registered under a single name.

  • cls (str) – A fully qualified name pointing to the class implementing an executor or launcher.

  • aliases (Optional[List[str]]) – An optional set of alternative names to make the executor available under as if its name was the alias.

  • nice_name (Optional[str]) – An optional string to use whenever a user-friendly name needs to be displayed to a user. For example, a nice name for pbs would be PBS or Portable Batch System. If not specified, the nice_name defaults to the value of the name parameter.

Return type

None

psij.exceptions module

A collection of exceptions used by PSI/J.

exception InvalidJobException(message, exception=None)[source]

Bases: Exception

An exception describing a problem with a job specification.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

exception SubmitException(message, exception=None, transient=False)[source]

Bases: Exception

An exception representing job submission issues.

This exception is thrown when the submit() call fails for a reason that is independent of the job that is being submitted.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

transient

Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.

psij.job module

class FunctionJobStatusCallback(fn)[source]

Bases: JobStatusCallback

A JobStatusCallback that wraps a function.

Initializes a _FunctionJobStatusCallback.

Parameters

fn (Callable[[Job, JobStatus], None]) –

job_status_changed(job, job_status)[source]

See job_status_changed().

Parameters
Return type

None

class Job(spec=None)[source]

Bases: object

This class represents a PSI/J job.

It encapsulates all of the information needed to run a job as well as the job’s state.

When constructed, a job is in the NEW state.

Parameters

spec (Optional[JobSpec]) – an optional JobSpec that describes the details of the job.

Return type

None

cancel()[source]

Cancels this job.

The job is canceled by calling cancel() on the job executor that was used to submit this job.

Raises

SubmitException – if the job has not yet been submitted.

Return type

None

property id: str

A read-only property containing the PSI/J job ID.

The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.

property native_id: Optional[str]

A read-only property containing the native ID of the job.

The native ID is the ID assigned to the job by the underlying implementation. The native ID may not be available until after the job is submitted to a JobExecutor, in which case the value of this property is None.

set_job_status_callback(cb)[source]

Registers a status callback with this job.

The callback can either be a subclass of JobStatusCallback or a procedure accepting two arguments: a Job and a JobStatus.

The callback is invoked whenever a status change occurs for this job, independent of any callback registered on the job’s JobExecutor. The callback can be removed by setting this property to None.

Parameters

cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of JobStatusCallback or a callable with two parameters, job of type Job, job_status of type JobStatus, and returning nothing.

Return type

None

spec

The job specification of this job.

property status: JobStatus

Contains the current status of the job.

It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of JobStatus types. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.

Returns

the current state of this job

wait(timeout=None, target_states=None)[source]

Waits for the job to reach certain states.

This method returns either when the job reaches one of the target_states, a state following one of the target_states, a final state, or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the JobStatus object that has one of the desired states or None if the timeout is reached. For example, wait(target_states = [JobState.QUEUED] waits until the job is in any of the QUEUED, ACTIVE, COMPLETED, FAILED, or CANCELED states.

Parameters
  • timeout (Optional[timedelta]) – An optional timeout after which this method returns even if none of the target_states was reached. If not specified, wait indefinitely.

  • target_states (Optional[Union[JobState, Sequence[JobState]]]) – A set of states to wait for. If not specified, wait for any of the final states.

Returns

returns the JobStatus object that caused the caused this call to complete or None if the timeout is specified and reached.

Return type

Optional[JobStatus]

class JobStatusCallback[source]

Bases: ABC

An interface used to listen to job status change events.

abstract job_status_changed(job, job_status)[source]

This method is invoked when a status change occurs on a job.

Client code interested in receiving status notifications must implement this method. It is entirely possible that psij.Job.status when referenced from the body of this method would return something different from the status passed to this callback. This is because the status of the job can be updated during the execution of the body of this method and, in particular, before the potential dereference to psij.Job.status is made.

Client code implementing this method must return quickly and cannot be used for lengthy processing. Furthermore, client code implementing this method should not throw exceptions.

Parameters
  • job (Job) – The job whose status has changed.

  • job_status (JobStatus) – The new status of the job.

Return type

None

psij.job_attributes module

class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, account=None, reservation_id=None, custom_attributes=None, project_name=None)[source]

Bases: object

A class containing ancillary job information that describes how a job is to be run.

Parameters
  • duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.

  • queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.

  • account (Optional[str]) – An account to use for billing purposes. Please note that the executor implementation (or batch scheduler) may use a different term for the option used for accounting/billing purposes, such as project. However, scheduler must map this attribute to the accounting/billing option in the underlying execution mechanism.

  • reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.

  • custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of JobExecutor define and are responsible for interpreting custom attributes. The typical usage scenario for custom attributes is to pass information to the executor or underlying job execution mechanism that cannot otherwise be passed using the classes and properties provided by PSI/J. A specific example is that of the subclasses of BatchSchedulerExecutor, which look for custom attributes prefixed with their name and a dot (e.g., slurm.constraint, pbs.c, lsf.core_isolation) and translate them into the corresponding batch scheduler directives (e.g., #SLURM –constraint=…, #PBS -c …, #BSUB -core_isolation …).

  • project_name (Optional[str]) – Deprecated. Please use the account attribute.

Return type

None

All constructor parameters are accessible as properties.

property custom_attributes: Optional[Dict[str, object]]

Returns a dictionary with the custom attributes.

get_custom_attribute(name)[source]

Retrieves the value of a custom attribute.

Parameters

name (str) –

Return type

Optional[object]

static parse_walltime(walltime)[source]

Parses a walltime string into a timedelta.

The accepted walltime strings formats are: * hh:mm:ss * hh:mm * mm * ns*[y|M|d|h|ms]

Parameters

walltime (str) – A string in one of the above formats representing a time duration

Returns

A timedelta representing the same time duration as the walltime parameter.

Return type

timedelta

property project_name: Optional[str]

Deprecated. Please use the account attribute.

set_custom_attribute(name, value)[source]

Sets a custom attribute.

Parameters
Return type

None

psij.job_executor module

class JobExecutor(url=None, config=None)[source]

Bases: ABC

An abstract base class for all JobExecutor implementations.

Parameters
  • url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.

  • config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.

abstract attach(job, native_id)[source]

Attaches a job to a native job.

Parameters
  • job (Job) – A job to attach. The job must be in the NEW state.

  • native_id (str) – The native ID to attach to as returned by native_id.

Return type

None

abstract cancel(job)[source]

Cancels a job that has been submitted to underlying executor implementation.

A successful return of this method only indicates that the request for cancellation has been communicated to the underlying implementation. The job will then be canceled at the discretion of the implementation, which may be at some later time. A successful cancellation is reflected in a change of status of the respective job to CANCELED. User code can synchronously wait until the CANCELED state is reached using job.wait(JobState.CANCELED) or even job.wait(), since the latter would wait for all final states, including JobState.CANCELED. In fact, it is recommended that job.wait() be used because it is entirely possible for the job to complete before the cancellation is communicated to the underlying implementation and before the client code receives the completion notification. In such a case, the job will never enter the CANCELED state and job.wait(JobState.CANCELED) would hang indefinitely.

Parameters

job (Job) – The job to be canceled.

Raises

SubmitException – Thrown if the request cannot be sent to the underlying implementation.

Return type

None

static get_executor_names()[source]

Returns a set of registered executor names.

Names returned by this method can be passed to get_instance() as the name parameter.

Returns

A set of executor names corresponding to the known executors.

Return type

Set[str]

static get_instance(name, version_constraint=None, url=None, config=None)[source]

Returns an instance of a JobExecutor.

Parameters
  • name (str) – The name of the executor to return. This must be one of the values returned by get_executor_names(). If the value of the name parameter is not one of the valid values returned by get_executor_names(), ValueError is raised.

  • version_constraint (Optional[str]) – A version constraint for the executor in the form ‘(‘ <op> <version>[, <op> <version[, …]] ‘)’, such as “( > 0.0.2, != 0.0.4)”.

  • url (Optional[str]) – An optional URL to pass to the JobExecutor instance.

  • config (Optional[JobExecutorConfig]) – An optional configuration to pass to the instance.

Returns

A JobExecutor.

Return type

JobExecutor

abstract list()[source]

List native IDs of all jobs known to the backend.

This method is meant to return a list of native IDs for jobs submitted to the backend by any means, not necessarily through this executor or through PSI/J.

Return type

List[str]

property name: str

Returns the name of this executor.

static register_executor(desc, root)[source]

Registers a JobExecutor class through a Descriptor.

The class can then be later instantiated using get_instance().

Parameters
  • desc (Descriptor) – A Descriptor with information about the executor to be registered.

  • root (str) – A filesystem path under which the implementation of the executor is to be loaded from. Executors from other locations, even if under the correct package, will not be registered by this method. If an executor implementation is only available under a different root path, this method will throw an exception.

Return type

None

set_job_status_callback(cb)[source]

Registers a status callback with this executor.

The callback can either be a subclass of JobStatusCallback or a procedure accepting two arguments: a Job and a JobStatus.

The callback will be invoked whenever a status change occurs for any of the jobs submitted to this job executor, whether they were submitted with an individual job status callback or not. To remove the callback, set it to None.

Parameters

cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of JobStatusCallback or a callable with two parameters: job of type Job and job_status of type JobStatus.

Return type

None

abstract submit(job)[source]

Submits a Job to the underlying implementation.

Successful return of this method indicates that the job has been sent to the underlying implementation and all changes in the job status, including failures, are reported using notifications. Conversely, if one of the two possible exceptions is thrown, then the job has not been successfully sent to the underlying implementation, the job status remains unchanged, and no status notifications about the job will be fired.

A successful return of this method guarantees that the job’s native_id property is set.

Raises
  • InvalidJobException – Thrown if the job specification cannot be understood. This exception is fatal in that submitting another job with the exact same details will also fail with an InvalidJobException. In principle, the underlying implementation / LRM is the entity ultimately responsible for interpreting a specification and reporting any errors associated with it. However, in many cases, this reporting may come after a significant delay. In the interest of failing fast, library implementations should make an effort of validating specifications early and throwing this exception as soon as possible if that validation fails.

  • SubmitException – Thrown if the request cannot be sent to the underlying implementation. Unlike InvalidJobException, this exception can occur for reasons that are transient.

Parameters

job (Job) –

Return type

None

property version: packaging.version.Version

Returns the version of this executor.

psij.job_executor_config module

class JobExecutorConfig(launcher_log_file=None, work_directory=None)[source]

Bases: object

An abstract configuration class for JobExecutor instances.

Parameters
  • launcher_log_file (Optional[Path]) – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.

  • work_directory (Optional[Path]) – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.

Return type

None

DEFAULT: JobExecutorConfig = <psij.job_executor_config.JobExecutorConfig object>

A default JobExecutorConfig used when none is specified.

DEFAULT_WORK_DIRECTORY = PosixPath('/home/runner/.psij/work')

The default work directory when a work directory is not explicitly specified.

property launcher_log_file: Optional[Path]

Configure the executor’s launcher log file.

Parameters

launcher_log_file – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.

property work_directory: Path

Configure the execor’s work directory.

Parameters

work_directory – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.

psij.job_launcher module

This module contains the core classes of the launchers infrastructure.

class Launcher(config=None)[source]

Bases: ABC

An abstract base class for all launchers.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration. If not specified, DEFAULT is used.

Return type

None

DEFAULT_LAUNCHER_NAME = 'single'
static get_instance(name, version_constraint=None, config=None)[source]

Returns an instance of a launcher optionally configured using a certain configuration.

The returned instance may or may not be a singleton object.

Parameters
Returns

A launcher instance.

Return type

Launcher

abstract get_launch_command(job)[source]

Constructs a command to launch a job given a job specification.

Parameters

job (Job) – The job to launch.

Returns

A list of strings representing the launch command and all of its arguments.

Return type

List[str]

static get_launcher_names()[source]

Returns a set of registered launcher names.

Names returned by this method can be passed to get_instance() as the name parameter.

Returns

A set of launcher names corresponding to the known executors.

Return type

Set[str]

static register_launcher(desc, root)[source]

Registers a launcher class.

The registered class can then be instantiated using get_instance().

Parameters
  • desc (Descriptor) – A Descriptor with information about the launcher to register.

  • root (str) – A filesystem path under which the implementation of the launcher is to be loaded from. Launchers from other locations, even if under the correct package, will not be registered by this method. If a launcher implementation is only available under a different root path, this method will throw an exception.

Return type

None

psij.job_spec module

class JobSpec(executable=None, arguments=None, directory=None, name=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None, stage_in=None, stage_out=None, cleanup=None, cleanup_flags=StageOutFlags.ALWAYS)[source]

Bases: object

A class that describes the details of a job.

Parameters
  • executable (Optional[str]) – An executable, such as “/bin/date”.

  • arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.

  • directory (Union[str, Path, None]) – The directory, on the compute side, in which the executable is to be run

  • name (Optional[str]) – A name for the job. The name plays no functional role except that JobExecutor implementations may attempt to use the name to label the job as presented by the underlying implementation.

  • inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.

  • environment (Optional[Dict[str, Union[str, int]]]) – A mapping of environment variable names to their respective values.

  • stdin_path (Union[str, Path, None]) – Path to a file whose contents will be sent to the job’s standard input.

  • stdout_path (Union[str, Path, None]) – A path to a file in which to place the standard output stream of the job.

  • stderr_path (Union[str, Path, None]) – A path to a file in which to place the standard error stream of the job.

  • resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.

  • attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.

  • pre_launch (Union[str, Path, None]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.

  • post_launch (Union[str, Path, None]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.

  • launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers, see Available Launchers.

  • stage_in (Optional[Set[StageIn]]) – Specifies a set of files to be staged in before the job is launched.

  • stage_out (Optional[Set[StageOut]]) – Specifies a set of files to be staged out after the job terminates.

  • cleanup (Optional[Set[Union[str, Path]]]) – Specifies a set of files to remove after the stage out process.

  • cleanup_flags (StageOutFlags) – Specifies the conditions under which the files in cleanup should be removed, such as when the job completes successfully. The flag StageOutFlags.IF_PRESENT is ignored and no error condition is triggered if a file specified by the cleanup argument is not present.

All constructor parameters are accessible as properties.

Note

A note about paths.

It is strongly recommended that paths to std*_path, directory, etc. be specified as absolute. While paths can be relative, and there are cases when it is desirable to specify them as relative, it is important to understand what the implications are.

Paths in a specification refer to paths that are accessible to the machine where the job is running. In most cases, that will be different from the machine on which the job is launched (i.e., where PSI/J is invoked from). This means that a given path may or may not point to the same file in both the location where the job is running and the location where the job is launched from.

For example, if launching jobs from a login node of a cluster, the path /tmp/foo.txt will likely refer to locally mounted drives on both the login node and the compute node(s) where the job is running. However, since they are local mounts, the file /tmp/foo.txt written by a job running on the compute node will not be visible by opening /tmp/foo.txt on the login node. If an output file written on a compute node needs to be accessed on a login node, that file should be placed on a shared filesystem. However, even by doing so, there is no guarantee that the shared filesystem is mounted under the same mount point on both login and compute nodes. While this is an unlikely scenario, it remains a possibility.

When relative paths are specified, even when they point to files on a shared filesystem as seen from the submission side (i.e., login node), the job working directory may be different from the working directory of the application that is launching the job. For example, an application that uses PSI/J to launch jobs on a cluster may be invoked from (and have its working directory set to) /home/foo, where /home is a mount point for a shared filesystem accessible by compute nodes. The launched job may specify stdout_path=Path(‘bar.txt’), which would resolve to /home/foo/bar.txt. However, the job may start in /tmp on the compute node, and its standard output will be redirected to /tmp/bar.txt.

Relative paths are useful when there is a need to refer to the job directory that the scheduler chooses for the job, which is not generally known until the job is started by the scheduler. In such a case, one must leave the spec.directory attribute empty and refer to files inside the job directory using relative paths.

property cleanup: Optional[Set[Path]]

An optional set of cleanup directives.

property directory: Optional[Path]

The directory, on the compute side, in which the executable is to be run.

property environment: Optional[Dict[str, str]]

Return the environment dict.

property name: Optional[str]

Returns the name of the job.

property post_launch: Optional[Path]

An optional path to a post-launch script.

The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.

property pre_launch: Optional[Path]

An optional path to a pre-launch script.

The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.

property stderr_path: Optional[Path]

A path to a file in which to place the standard error stream of the job.

property stdin_path: Optional[Path]

A path to a file whose contents will be sent to the job’s standard input.

property stdout_path: Optional[Path]

A path to a file in which to place the standard output stream of the job.

psij.job_state module

class JobState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: bytes, Enum

An enumeration holding the possible job states.

The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.

ACTIVE = 3

This state represents an actively running job.

CANCELED = 8

Represents a job that was canceled by a call to cancel().

CLEANUP = 5

This state indicates that cleanup is actively being done for this job.

COMPLETED = 6

This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.

FAILED = 7

Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.

NEW = 0

This is the state of a job immediately after the Job object is created and before being submitted to a JobExecutor.

QUEUED = 1

This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.

STAGE_IN = 2

This state indicates that the job is staging files in, in preparation for execution.

STAGE_OUT = 4

This state indicates that the executable has finished running and that files are being staged out.

property final: bool

Returns True if this state final.

A state is final when no other state transition can occur after that state has been reached.

Returns

True if this is a final state and False otherwise

static from_name(name)[source]

Returns a JobState object corresponding to its string representation.

This method is such that state == JobState.from_name(str(state)).

Parameters

name (str) –

Return type

JobState

is_greater_than(other)[source]

Defines a (strict) partial ordering on the states.

Not all states are comparable. State transitions cannot violate this ordering.

Parameters

other (JobState) – the other JobState to compare to

Returns

if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.

Return type

Optional[bool]

class JobStateOrder[source]

Bases: object

A class that can be used to reconstruct missing states.

static prev(state)[source]

Returns the state previous to the given state.

The “previous” state is a state that must have occurred immediately prior to this state given the state transition diagram if such a state is unique. Not all states have a previous state. For example, the FAILED state does not have a previous state, since it can be reached from multiple states.

Parameters

state (JobState) –

Return type

Optional[JobState]

psij.job_status module

class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]

Bases: object

A class containing details about job transitions to new states.

Parameters
  • state (JobState) – The JobState of this status.

  • time (Optional[float]) – The time, as would be returned by time.time(), at which the transition to the new state occurred. If not specified, the time when this JobStatus was instantiated will be used.

  • message (Optional[str]) – An optional message associated with the transition.

  • exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.

  • metadata (Optional[Dict[str, object]]) – Optional metadata provided by the JobExecutor.

Return type

None

All constructor parameters are accessible as properties.

property final: bool

Returns the final property of the underlying state.

Returns

True if the state is final and False otherwise.

psij.launcher module

psij.resource_spec module

class ResourceSpec[source]

Bases: ABC

A base class for resource specifications.

The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.

static get_instance(version)[source]

Creates an instance of a ResourceSpec of the specified version.

Parameters

version (int) – The version of ResourceSpec to instantiate. For example, if version == 1, this method will return a new instance of ResourceSpecV1.

Return type

ResourceSpec

abstract property version: int

Returns the version of this resource specification class.

class ResourceSpecV1(node_count=None, process_count=None, processes_per_node=None, cpu_cores_per_process=None, gpu_cores_per_process=None, exclusive_node_use=False, memory=None)[source]

Bases: ResourceSpec

This class implements V1 of the PSI/J resource specification.

Some of the properties of this class are constrained. Specifically, process_count = node_count * processes_per_node. Specifying all constrained properties in a way that does not satisfy the constraint will result in an error. Specifying some of the constrained properties will result in the remaining one being inferred based on the constraint. This inference is done by this class. However, executor implementations may chose to delegate this inference to an underlying implementation and ignore the values inferred by this class.

Parameters
  • node_count (Optional[int]) – If specified, request that the backend allocate this many compute nodes for the job.

  • process_count (Optional[int]) – If specified, instruct the backend to start this many process instances. This defaults to 1.

  • processes_per_node (Optional[int]) – Instruct the backend to run this many process instances on each node.

  • cpu_cores_per_process (Optional[int]) – Request this many CPU cores for each process instance. This property is used by a backend to calculate the number of nodes from the process_count

  • gpu_cores_per_process (Optional[int]) – Request this many GPU cores for each process instance.

  • exclusive_node_use (bool) – If this parameter is set to True, the LRM is instructed to allocate to this job only nodes that are not running any other jobs, even if this job is requesting fewer cores than the total number of cores on a node. With this parameter set to False, which is the default, the LRM is free to co-schedule multiple jobs on a given node if the number of cores requested by those jobs total less than the amount available on the node.

  • memory (Optional[int]) – The total amount, in bytes, of memory requested for the job.

Return type

None

All constructor parameters are accessible as properties.

property computed_node_count: int

Returns or calculates a node count.

If the node_count property is specified, this method returns it. If not, a node count is calculated from process_count and processes_per_node.

Returns

An integer value with the specified or calculated node count.

property computed_process_count: int

Returns or calculates a process count.

If the process_count property is specified, this method returns it, otherwise it returns 1.

Returns

An integer value with either the value of process_count or one if the former is not specified.

property computed_processes_per_node: int

Returns or calculates the number of processes per node.

If the processes_per_node property is specified, this method returns it, otherwise calculates it based on process_count and node_count if possible, or defaults to 1.

Returns

An integer value with either the value of processes_per_node or one if the former cannot be determined.

property memory_kb: Optional[int]

Returns the memory limit specified by the memory property, but in KB.

Returns

If the memory property is set on this object, returns memory // 1024. If the memory property is None, this method returns None.

property version: int

Returns the version of this ResourceSpec, which is 1 for this class.

psij.serialize module

class JSONSerializer[source]

Bases: Serializer

A JSON serializer.

class Serializer[source]

Bases: ABC

A base class for serializers.

This class takes care of converting a JobSpec instance, including all its properties, into an intermediate representation consisting of a tree of standard dictionaries and lists, where dictionary keys are guaranteed to be strings and values are limited to dictionaries, lists, str, int, and bool. It also takes care of making the reverse conversion. Concrete implementations of serializers should extend this class and implement the _dump_dict and _load_dict methods, which convert the intermediate representation to the actual serialized format.

Serializer implementations can also directly override the dump, dumps, load, and loads methods to bypass the intermediate representations and implement (de)serialization directly.

dump(spec, stream)[source]

Serialize the given JobSpec and write the results to stream.

Parameters
  • spec (JobSpec) – The JobSpec to serialize.

  • stream (IO) – A stream to write the serialized JobSpec to. Concrete serializers may require that the stream be a binary or text stream.

Return type

None

dumps(spec)[source]

Serialize the given JobSpec to a string.

Serializer implementations that use a binary protocol must override this method and raise an error.

Parameters

spec (JobSpec) – The JobSpec to serialize.

Returns

A string representation of the spec.

Return type

str

load(stream)[source]

Deserialize the contents of a stream to an instance of JobSpec.

Parameters

stream (IO) – A stream to read the serialized JobSpec from. Concrete serializers may require that the stream be a binary or text stream.

Returns

The deserialized JobSpec instance.

Return type

JobSpec

loads(s)[source]

Deserialize a JobSpec from a string.

Serializer implementations that use a binary protocol must override this method and raise an error.

Parameters

s (str) – The string containing the serialized representation of a JobSpec.

Returns

The deserialized JobSpec instance.

Return type

JobSpec

psij.utils module

class SingletonThread(name=None, daemon=False)[source]

Bases: Thread

A convenience class to return a thread that is guaranteed to be unique to this process.

This is intended to work with fork() to ensure that each os.getpid() value is associated with at most one thread. This is not safe. The safe thing, as pointed out by the fork() man page, is to not use fork() with threads. However, this is here in an attempt to make it slightly safer for when users really really want to take the risk against all advice.

This class is meant as an abstract class and should be used by subclassing and implementing the run method.

Instantiation of this class or one of its subclasses should be done through the get_instance() method rather than directly.

Parameters
  • name (Optional[str]) – An optional name for this thread.

  • daemon (bool) – A daemon thread does not prevent the process from exiting.

Return type

None

classmethod get_instance()[source]

Returns a started instance of this thread.

The instance is guaranteed to be unique for this process. This method also guarantees that a forked process will get a separate instance of this thread from the parent.

Return type

SingletonThread

psij.version module

This module stores the current version of this library.

Module contents

The package containing the jobs module of this PSI implementation.

exception InvalidJobException(message, exception=None)[source]

Bases: Exception

An exception describing a problem with a job specification.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

class Job(spec=None)[source]

Bases: object

This class represents a PSI/J job.

It encapsulates all of the information needed to run a job as well as the job’s state.

When constructed, a job is in the NEW state.

Parameters

spec (Optional[JobSpec]) – an optional JobSpec that describes the details of the job.

Return type

None

cancel()[source]

Cancels this job.

The job is canceled by calling cancel() on the job executor that was used to submit this job.

Raises

SubmitException – if the job has not yet been submitted.

Return type

None

executor: Optional[JobExecutor]
property id: str

A read-only property containing the PSI/J job ID.

The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.

property native_id: Optional[str]

A read-only property containing the native ID of the job.

The native ID is the ID assigned to the job by the underlying implementation. The native ID may not be available until after the job is submitted to a JobExecutor, in which case the value of this property is None.

set_job_status_callback(cb)[source]

Registers a status callback with this job.

The callback can either be a subclass of JobStatusCallback or a procedure accepting two arguments: a Job and a JobStatus.

The callback is invoked whenever a status change occurs for this job, independent of any callback registered on the job’s JobExecutor. The callback can be removed by setting this property to None.

Parameters

cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of JobStatusCallback or a callable with two parameters, job of type Job, job_status of type JobStatus, and returning nothing.

Return type

None

spec

The job specification of this job.

property status: JobStatus

Contains the current status of the job.

It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of JobStatus types. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.

Returns

the current state of this job

wait(timeout=None, target_states=None)[source]

Waits for the job to reach certain states.

This method returns either when the job reaches one of the target_states, a state following one of the target_states, a final state, or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the JobStatus object that has one of the desired states or None if the timeout is reached. For example, wait(target_states = [JobState.QUEUED] waits until the job is in any of the QUEUED, ACTIVE, COMPLETED, FAILED, or CANCELED states.

Parameters
  • timeout (Optional[timedelta]) – An optional timeout after which this method returns even if none of the target_states was reached. If not specified, wait indefinitely.

  • target_states (Optional[Union[JobState, Sequence[JobState]]]) – A set of states to wait for. If not specified, wait for any of the final states.

Returns

returns the JobStatus object that caused the caused this call to complete or None if the timeout is specified and reached.

Return type

Optional[JobStatus]

class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, account=None, reservation_id=None, custom_attributes=None, project_name=None)[source]

Bases: object

A class containing ancillary job information that describes how a job is to be run.

Parameters
  • duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.

  • queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.

  • account (Optional[str]) – An account to use for billing purposes. Please note that the executor implementation (or batch scheduler) may use a different term for the option used for accounting/billing purposes, such as project. However, scheduler must map this attribute to the accounting/billing option in the underlying execution mechanism.

  • reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.

  • custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of JobExecutor define and are responsible for interpreting custom attributes. The typical usage scenario for custom attributes is to pass information to the executor or underlying job execution mechanism that cannot otherwise be passed using the classes and properties provided by PSI/J. A specific example is that of the subclasses of BatchSchedulerExecutor, which look for custom attributes prefixed with their name and a dot (e.g., slurm.constraint, pbs.c, lsf.core_isolation) and translate them into the corresponding batch scheduler directives (e.g., #SLURM –constraint=…, #PBS -c …, #BSUB -core_isolation …).

  • project_name (Optional[str]) – Deprecated. Please use the account attribute.

Return type

None

All constructor parameters are accessible as properties.

property custom_attributes: Optional[Dict[str, object]]

Returns a dictionary with the custom attributes.

get_custom_attribute(name)[source]

Retrieves the value of a custom attribute.

Parameters

name (str) –

Return type

Optional[object]

static parse_walltime(walltime)[source]

Parses a walltime string into a timedelta.

The accepted walltime strings formats are: * hh:mm:ss * hh:mm * mm * ns*[y|M|d|h|ms]

Parameters

walltime (str) – A string in one of the above formats representing a time duration

Returns

A timedelta representing the same time duration as the walltime parameter.

Return type

timedelta

property project_name: Optional[str]

Deprecated. Please use the account attribute.

set_custom_attribute(name, value)[source]

Sets a custom attribute.

Parameters
Return type

None

class JobExecutor(url=None, config=None)[source]

Bases: ABC

An abstract base class for all JobExecutor implementations.

Parameters
  • url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.

  • config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.

abstract attach(job, native_id)[source]

Attaches a job to a native job.

Parameters
  • job (Job) – A job to attach. The job must be in the NEW state.

  • native_id (str) – The native ID to attach to as returned by native_id.

Return type

None

abstract cancel(job)[source]

Cancels a job that has been submitted to underlying executor implementation.

A successful return of this method only indicates that the request for cancellation has been communicated to the underlying implementation. The job will then be canceled at the discretion of the implementation, which may be at some later time. A successful cancellation is reflected in a change of status of the respective job to CANCELED. User code can synchronously wait until the CANCELED state is reached using job.wait(JobState.CANCELED) or even job.wait(), since the latter would wait for all final states, including JobState.CANCELED. In fact, it is recommended that job.wait() be used because it is entirely possible for the job to complete before the cancellation is communicated to the underlying implementation and before the client code receives the completion notification. In such a case, the job will never enter the CANCELED state and job.wait(JobState.CANCELED) would hang indefinitely.

Parameters

job (Job) – The job to be canceled.

Raises

SubmitException – Thrown if the request cannot be sent to the underlying implementation.

Return type

None

static get_executor_names()[source]

Returns a set of registered executor names.

Names returned by this method can be passed to get_instance() as the name parameter.

Returns

A set of executor names corresponding to the known executors.

Return type

Set[str]

static get_instance(name, version_constraint=None, url=None, config=None)[source]

Returns an instance of a JobExecutor.

Parameters
  • name (str) – The name of the executor to return. This must be one of the values returned by get_executor_names(). If the value of the name parameter is not one of the valid values returned by get_executor_names(), ValueError is raised.

  • version_constraint (Optional[str]) – A version constraint for the executor in the form ‘(‘ <op> <version>[, <op> <version[, …]] ‘)’, such as “( > 0.0.2, != 0.0.4)”.

  • url (Optional[str]) – An optional URL to pass to the JobExecutor instance.

  • config (Optional[JobExecutorConfig]) – An optional configuration to pass to the instance.

Returns

A JobExecutor.

Return type

JobExecutor

abstract list()[source]

List native IDs of all jobs known to the backend.

This method is meant to return a list of native IDs for jobs submitted to the backend by any means, not necessarily through this executor or through PSI/J.

Return type

List[str]

property name: str

Returns the name of this executor.

static register_executor(desc, root)[source]

Registers a JobExecutor class through a Descriptor.

The class can then be later instantiated using get_instance().

Parameters
  • desc (Descriptor) – A Descriptor with information about the executor to be registered.

  • root (str) – A filesystem path under which the implementation of the executor is to be loaded from. Executors from other locations, even if under the correct package, will not be registered by this method. If an executor implementation is only available under a different root path, this method will throw an exception.

Return type

None

set_job_status_callback(cb)[source]

Registers a status callback with this executor.

The callback can either be a subclass of JobStatusCallback or a procedure accepting two arguments: a Job and a JobStatus.

The callback will be invoked whenever a status change occurs for any of the jobs submitted to this job executor, whether they were submitted with an individual job status callback or not. To remove the callback, set it to None.

Parameters

cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of JobStatusCallback or a callable with two parameters: job of type Job and job_status of type JobStatus.

Return type

None

abstract submit(job)[source]

Submits a Job to the underlying implementation.

Successful return of this method indicates that the job has been sent to the underlying implementation and all changes in the job status, including failures, are reported using notifications. Conversely, if one of the two possible exceptions is thrown, then the job has not been successfully sent to the underlying implementation, the job status remains unchanged, and no status notifications about the job will be fired.

A successful return of this method guarantees that the job’s native_id property is set.

Raises
  • InvalidJobException – Thrown if the job specification cannot be understood. This exception is fatal in that submitting another job with the exact same details will also fail with an InvalidJobException. In principle, the underlying implementation / LRM is the entity ultimately responsible for interpreting a specification and reporting any errors associated with it. However, in many cases, this reporting may come after a significant delay. In the interest of failing fast, library implementations should make an effort of validating specifications early and throwing this exception as soon as possible if that validation fails.

  • SubmitException – Thrown if the request cannot be sent to the underlying implementation. Unlike InvalidJobException, this exception can occur for reasons that are transient.

Parameters

job (Job) –

Return type

None

property version: packaging.version.Version

Returns the version of this executor.

class JobExecutorConfig(launcher_log_file=None, work_directory=None)[source]

Bases: object

An abstract configuration class for JobExecutor instances.

Parameters
  • launcher_log_file (Optional[Path]) – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.

  • work_directory (Optional[Path]) – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.

Return type

None

DEFAULT: JobExecutorConfig = <psij.job_executor_config.JobExecutorConfig object>

A default JobExecutorConfig used when none is specified.

DEFAULT_WORK_DIRECTORY = PosixPath('/home/runner/.psij/work')

The default work directory when a work directory is not explicitly specified.

property launcher_log_file: Optional[Path]

Configure the executor’s launcher log file.

Parameters

launcher_log_file – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.

property work_directory: Path

Configure the execor’s work directory.

Parameters

work_directory – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.

class JobSpec(executable=None, arguments=None, directory=None, name=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None, stage_in=None, stage_out=None, cleanup=None, cleanup_flags=StageOutFlags.ALWAYS)[source]

Bases: object

A class that describes the details of a job.

Parameters
  • executable (Optional[str]) – An executable, such as “/bin/date”.

  • arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.

  • directory (Union[str, Path, None]) – The directory, on the compute side, in which the executable is to be run

  • name (Optional[str]) – A name for the job. The name plays no functional role except that JobExecutor implementations may attempt to use the name to label the job as presented by the underlying implementation.

  • inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.

  • environment (Optional[Dict[str, Union[str, int]]]) – A mapping of environment variable names to their respective values.

  • stdin_path (Union[str, Path, None]) – Path to a file whose contents will be sent to the job’s standard input.

  • stdout_path (Union[str, Path, None]) – A path to a file in which to place the standard output stream of the job.

  • stderr_path (Union[str, Path, None]) – A path to a file in which to place the standard error stream of the job.

  • resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.

  • attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.

  • pre_launch (Union[str, Path, None]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.

  • post_launch (Union[str, Path, None]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.

  • launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers, see Available Launchers.

  • stage_in (Optional[Set[StageIn]]) – Specifies a set of files to be staged in before the job is launched.

  • stage_out (Optional[Set[StageOut]]) – Specifies a set of files to be staged out after the job terminates.

  • cleanup (Optional[Set[Union[str, Path]]]) – Specifies a set of files to remove after the stage out process.

  • cleanup_flags (StageOutFlags) – Specifies the conditions under which the files in cleanup should be removed, such as when the job completes successfully. The flag StageOutFlags.IF_PRESENT is ignored and no error condition is triggered if a file specified by the cleanup argument is not present.

All constructor parameters are accessible as properties.

Note

A note about paths.

It is strongly recommended that paths to std*_path, directory, etc. be specified as absolute. While paths can be relative, and there are cases when it is desirable to specify them as relative, it is important to understand what the implications are.

Paths in a specification refer to paths that are accessible to the machine where the job is running. In most cases, that will be different from the machine on which the job is launched (i.e., where PSI/J is invoked from). This means that a given path may or may not point to the same file in both the location where the job is running and the location where the job is launched from.

For example, if launching jobs from a login node of a cluster, the path /tmp/foo.txt will likely refer to locally mounted drives on both the login node and the compute node(s) where the job is running. However, since they are local mounts, the file /tmp/foo.txt written by a job running on the compute node will not be visible by opening /tmp/foo.txt on the login node. If an output file written on a compute node needs to be accessed on a login node, that file should be placed on a shared filesystem. However, even by doing so, there is no guarantee that the shared filesystem is mounted under the same mount point on both login and compute nodes. While this is an unlikely scenario, it remains a possibility.

When relative paths are specified, even when they point to files on a shared filesystem as seen from the submission side (i.e., login node), the job working directory may be different from the working directory of the application that is launching the job. For example, an application that uses PSI/J to launch jobs on a cluster may be invoked from (and have its working directory set to) /home/foo, where /home is a mount point for a shared filesystem accessible by compute nodes. The launched job may specify stdout_path=Path(‘bar.txt’), which would resolve to /home/foo/bar.txt. However, the job may start in /tmp on the compute node, and its standard output will be redirected to /tmp/bar.txt.

Relative paths are useful when there is a need to refer to the job directory that the scheduler chooses for the job, which is not generally known until the job is started by the scheduler. In such a case, one must leave the spec.directory attribute empty and refer to files inside the job directory using relative paths.

property cleanup: Optional[Set[Path]]

An optional set of cleanup directives.

property directory: Optional[Path]

The directory, on the compute side, in which the executable is to be run.

property environment: Optional[Dict[str, str]]

Return the environment dict.

property name: Optional[str]

Returns the name of the job.

property post_launch: Optional[Path]

An optional path to a post-launch script.

The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.

property pre_launch: Optional[Path]

An optional path to a pre-launch script.

The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.

property stderr_path: Optional[Path]

A path to a file in which to place the standard error stream of the job.

property stdin_path: Optional[Path]

A path to a file whose contents will be sent to the job’s standard input.

property stdout_path: Optional[Path]

A path to a file in which to place the standard output stream of the job.

class JobState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: bytes, Enum

An enumeration holding the possible job states.

The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.

ACTIVE = 3

This state represents an actively running job.

CANCELED = 8

Represents a job that was canceled by a call to cancel().

CLEANUP = 5

This state indicates that cleanup is actively being done for this job.

COMPLETED = 6

This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.

FAILED = 7

Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.

NEW = 0

This is the state of a job immediately after the Job object is created and before being submitted to a JobExecutor.

QUEUED = 1

This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.

STAGE_IN = 2

This state indicates that the job is staging files in, in preparation for execution.

STAGE_OUT = 4

This state indicates that the executable has finished running and that files are being staged out.

property final: bool

Returns True if this state final.

A state is final when no other state transition can occur after that state has been reached.

Returns

True if this is a final state and False otherwise

static from_name(name)[source]

Returns a JobState object corresponding to its string representation.

This method is such that state == JobState.from_name(str(state)).

Parameters

name (str) –

Return type

JobState

is_greater_than(other)[source]

Defines a (strict) partial ordering on the states.

Not all states are comparable. State transitions cannot violate this ordering.

Parameters

other (JobState) – the other JobState to compare to

Returns

if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.

Return type

Optional[bool]

class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]

Bases: object

A class containing details about job transitions to new states.

Parameters
  • state (JobState) – The JobState of this status.

  • time (Optional[float]) – The time, as would be returned by time.time(), at which the transition to the new state occurred. If not specified, the time when this JobStatus was instantiated will be used.

  • message (Optional[str]) – An optional message associated with the transition.

  • exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.

  • metadata (Optional[Dict[str, object]]) – Optional metadata provided by the JobExecutor.

Return type

None

All constructor parameters are accessible as properties.

property final: bool

Returns the final property of the underlying state.

Returns

True if the state is final and False otherwise.

class JobStatusCallback[source]

Bases: ABC

An interface used to listen to job status change events.

abstract job_status_changed(job, job_status)[source]

This method is invoked when a status change occurs on a job.

Client code interested in receiving status notifications must implement this method. It is entirely possible that psij.Job.status when referenced from the body of this method would return something different from the status passed to this callback. This is because the status of the job can be updated during the execution of the body of this method and, in particular, before the potential dereference to psij.Job.status is made.

Client code implementing this method must return quickly and cannot be used for lengthy processing. Furthermore, client code implementing this method should not throw exceptions.

Parameters
  • job (Job) – The job whose status has changed.

  • job_status (JobStatus) – The new status of the job.

Return type

None

class Launcher(config=None)[source]

Bases: ABC

An abstract base class for all launchers.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration. If not specified, DEFAULT is used.

Return type

None

DEFAULT_LAUNCHER_NAME = 'single'
static get_instance(name, version_constraint=None, config=None)[source]

Returns an instance of a launcher optionally configured using a certain configuration.

The returned instance may or may not be a singleton object.

Parameters
Returns

A launcher instance.

Return type

Launcher

abstract get_launch_command(job)[source]

Constructs a command to launch a job given a job specification.

Parameters

job (Job) – The job to launch.

Returns

A list of strings representing the launch command and all of its arguments.

Return type

List[str]

static get_launcher_names()[source]

Returns a set of registered launcher names.

Names returned by this method can be passed to get_instance() as the name parameter.

Returns

A set of launcher names corresponding to the known executors.

Return type

Set[str]

static register_launcher(desc, root)[source]

Registers a launcher class.

The registered class can then be instantiated using get_instance().

Parameters
  • desc (Descriptor) – A Descriptor with information about the launcher to register.

  • root (str) – A filesystem path under which the implementation of the launcher is to be loaded from. Launchers from other locations, even if under the correct package, will not be registered by this method. If a launcher implementation is only available under a different root path, this method will throw an exception.

Return type

None

class ResourceSpec[source]

Bases: ABC

A base class for resource specifications.

The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.

static get_instance(version)[source]

Creates an instance of a ResourceSpec of the specified version.

Parameters

version (int) – The version of ResourceSpec to instantiate. For example, if version == 1, this method will return a new instance of ResourceSpecV1.

Return type

ResourceSpec

abstract property version: int

Returns the version of this resource specification class.

class ResourceSpecV1(node_count=None, process_count=None, processes_per_node=None, cpu_cores_per_process=None, gpu_cores_per_process=None, exclusive_node_use=False, memory=None)[source]

Bases: ResourceSpec

This class implements V1 of the PSI/J resource specification.

Some of the properties of this class are constrained. Specifically, process_count = node_count * processes_per_node. Specifying all constrained properties in a way that does not satisfy the constraint will result in an error. Specifying some of the constrained properties will result in the remaining one being inferred based on the constraint. This inference is done by this class. However, executor implementations may chose to delegate this inference to an underlying implementation and ignore the values inferred by this class.

Parameters
  • node_count (Optional[int]) – If specified, request that the backend allocate this many compute nodes for the job.

  • process_count (Optional[int]) – If specified, instruct the backend to start this many process instances. This defaults to 1.

  • processes_per_node (Optional[int]) – Instruct the backend to run this many process instances on each node.

  • cpu_cores_per_process (Optional[int]) – Request this many CPU cores for each process instance. This property is used by a backend to calculate the number of nodes from the process_count

  • gpu_cores_per_process (Optional[int]) – Request this many GPU cores for each process instance.

  • exclusive_node_use (bool) – If this parameter is set to True, the LRM is instructed to allocate to this job only nodes that are not running any other jobs, even if this job is requesting fewer cores than the total number of cores on a node. With this parameter set to False, which is the default, the LRM is free to co-schedule multiple jobs on a given node if the number of cores requested by those jobs total less than the amount available on the node.

  • memory (Optional[int]) – The total amount, in bytes, of memory requested for the job.

Return type

None

All constructor parameters are accessible as properties.

property computed_node_count: int

Returns or calculates a node count.

If the node_count property is specified, this method returns it. If not, a node count is calculated from process_count and processes_per_node.

Returns

An integer value with the specified or calculated node count.

property computed_process_count: int

Returns or calculates a process count.

If the process_count property is specified, this method returns it, otherwise it returns 1.

Returns

An integer value with either the value of process_count or one if the former is not specified.

property computed_processes_per_node: int

Returns or calculates the number of processes per node.

If the processes_per_node property is specified, this method returns it, otherwise calculates it based on process_count and node_count if possible, or defaults to 1.

Returns

An integer value with either the value of processes_per_node or one if the former cannot be determined.

property memory_kb: Optional[int]

Returns the memory limit specified by the memory property, but in KB.

Returns

If the memory property is set on this object, returns memory // 1024. If the memory property is None, this method returns None.

property version: int

Returns the version of this ResourceSpec, which is 1 for this class.

exception SubmitException(message, exception=None, transient=False)[source]

Bases: Exception

An exception representing job submission issues.

This exception is thrown when the submit() call fails for a reason that is independent of the job that is being submitted.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

transient

Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.