psij package¶
Subpackages¶
- psij.executors package
- Subpackages
- psij.executors.batch package
- Submodules
- psij.executors.batch.batch_scheduler_executor module
- psij.executors.batch.cobalt module
- psij.executors.batch.escape_functions module
- psij.executors.batch.lsf module
- psij.executors.batch.pbspro module
- psij.executors.batch.script_generator module
- psij.executors.batch.slurm module
- psij.executors.batch.template_function_library module
- Module contents
- psij.executors.batch package
- Submodules
- psij.executors.flux module
- psij.executors.local module
- psij.executors.rp module
- Module contents
- Subpackages
- psij.launchers package
Submodules¶
psij.descriptor module¶
Executor/Launcher descriptor module.
- class Descriptor(name, version, cls, aliases=None, nice_name=None)[source]¶
Bases:
objectThis class is used to enable PSI/J to discover and register executors and/or launchers.
Executors wanting to register with PSI/J must place an instance of this class in a global module list named __PSI_J_EXECUTORS__ or __PSI_J_LAUNCHERS__ in a module placed in the psij-descriptors namespace package. In other words, in order to automatically register an executor or launcher, a python file should be created inside a psij-descriptors package, such as:
<project_root>/ src/ psij-descriptors/ descriptors_for_project.pyIt is essential that the psij-descriptors package not contain an __init__.py file in order for Python to treat the package as a namespace package. This allows Python to combine multiple psij-descriptors directories into one, which, in turn, allows PSI/J to detect and load all descriptors that can be found in Python’s library search path.
The contents of descriptors_for_project.py could then be as follows:
from packaging.version import Version from psij.descriptor import Descriptor __PSI_J_EXECUTORS__ = [ Descriptor(name=<name>, version=Version(<version_str>), cls=<fqn_str>), ... ] __PSI_J_LAUNCHERS__ = [ Descriptor(name=<name>, version=Version(<version_str>), cls=<fqn_str>), ... ]
where <name> stands for the name used to instantiate the executor or launcher, <version_str> is a version string such as 1.0.2, and <fqn_str> is the fully qualified class name that implements the executor or launcher such as psij.executors.local.LocalJobExecutor.
- Parameters
name (str) – The name of the executor or launcher. The automatic registration system will register the executor or launcher using this name. That is, the executor or launcher represented by this descriptor will be available for instantiation using either
get_instance()orget_instance()version (Version) – The version of the executor/launcher. Multiple versions can be registered under a single name.
cls (str) – A fully qualified name pointing to the class implementing an executor or launcher.
aliases (Optional[List[str]]) – An optional set of alternative names to make the executor available under as if its name was the alias.
nice_name (Optional[str]) – An optional string to use whenever a user-friendly name needs to be displayed to a user. For example, a nice name for pbs would be PBS or Portable Batch System. If not specified, the nice_name defaults to the value of the name parameter.
- Return type
None
psij.exceptions module¶
A collection of exceptions used by PSI/J.
- exception InvalidJobException(message, exception=None)[source]¶
Bases:
ExceptionAn exception describing a problem with a job specification.
- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- exception SubmitException(message, exception=None, transient=False)[source]¶
Bases:
ExceptionAn exception representing job submission issues.
This exception is thrown when the
submit()call fails for a reason that is independent of the job that is being submitted.- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- transient¶
Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.
psij.job module¶
- class FunctionJobStatusCallback(fn)[source]¶
Bases:
JobStatusCallbackA JobStatusCallback that wraps a function.
Initializes a _FunctionJobStatusCallback.
- job_status_changed(job, job_status)[source]¶
See
job_status_changed().
- class Job(spec=None)[source]¶
Bases:
objectThis class represents a PSI/J job.
It encapsulates all of the information needed to run a job as well as the job’s state.
When constructed, a job is in the
NEWstate.- Parameters
spec (Optional[JobSpec]) – an optional
JobSpecthat describes the details of the job.- Return type
None
- cancel()[source]¶
Cancels this job.
The job is canceled by calling
cancel()on the job executor that was used to submit this job.- Raises
SubmitException – if the job has not yet been submitted.
- Return type
None
- property id: str¶
A read-only property containing the PSI/J job ID.
The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.
- property native_id: Optional[str]¶
A read-only property containing the native ID of the job.
The native ID is the ID assigned to the job by the underlying implementation. The native ID may not be available until after the job is submitted to a
JobExecutor, in which case the value of this property isNone.
- set_job_status_callback(cb)[source]¶
Registers a status callback with this job.
The callback can either be a subclass of
JobStatusCallbackor a procedure accepting two arguments: aJoband aJobStatus.The callback is invoked whenever a status change occurs for this job, independent of any callback registered on the job’s
JobExecutor. The callback can be removed by setting this property toNone.- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallbackor a callable with two parameters,jobof typeJob,job_statusof typeJobStatus, and returning nothing.- Return type
None
- spec¶
The job specification of this job.
- property status: JobStatus¶
Contains the current status of the job.
It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of
JobStatustypes. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.- Returns
the current state of this job
- wait(timeout=None, target_states=None)[source]¶
Waits for the job to reach certain states.
This method returns either when the job reaches one of the target_states, a state following one of the target_states, a final state, or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the
JobStatusobject that has one of the desired states or None if the timeout is reached. For example, wait(target_states = [JobState.QUEUED] waits until the job is in any of the QUEUED, ACTIVE, COMPLETED, FAILED, or CANCELED states.- Parameters
timeout (Optional[timedelta]) – An optional timeout after which this method returns even if none of the target_states was reached. If not specified, wait indefinitely.
target_states (Optional[Union[JobState, Sequence[JobState]]]) – A set of states to wait for. If not specified, wait for any of the
finalstates.
- Returns
returns the
JobStatusobject that caused the caused this call to complete or None if the timeout is specified and reached.- Return type
- class JobStatusCallback[source]¶
Bases:
ABCAn interface used to listen to job status change events.
- abstract job_status_changed(job, job_status)[source]¶
This method is invoked when a status change occurs on a job.
Client code interested in receiving status notifications must implement this method. It is entirely possible that
psij.Job.statuswhen referenced from the body of this method would return something different from the status passed to this callback. This is because the status of the job can be updated during the execution of the body of this method and, in particular, before the potential dereference topsij.Job.statusis made.Client code implementing this method must return quickly and cannot be used for lengthy processing. Furthermore, client code implementing this method should not throw exceptions.
psij.job_attributes module¶
- class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, account=None, reservation_id=None, custom_attributes=None, project_name=None)[source]¶
Bases:
objectA class containing ancillary job information that describes how a job is to be run.
- Parameters
duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.
queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.
account (Optional[str]) – An account to use for billing purposes. Please note that the executor implementation (or batch scheduler) may use a different term for the option used for accounting/billing purposes, such as project. However, scheduler must map this attribute to the accounting/billing option in the underlying execution mechanism.
reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.
custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of
JobExecutordefine and are responsible for interpreting custom attributes. The typical usage scenario for custom attributes is to pass information to the executor or underlying job execution mechanism that cannot otherwise be passed using the classes and properties provided by PSI/J. A specific example is that of the subclasses ofBatchSchedulerExecutor, which look for custom attributes prefixed with their name and a dot (e.g., slurm.constraint, pbs.c, lsf.core_isolation) and translate them into the corresponding batch scheduler directives (e.g., #SLURM –constraint=…, #PBS -c …, #BSUB -core_isolation …).project_name (Optional[str]) – Deprecated. Please use the account attribute.
- Return type
None
All constructor parameters are accessible as properties.
- property custom_attributes: Optional[Dict[str, object]]¶
Returns a dictionary with the custom attributes.
psij.job_executor module¶
- class JobExecutor(url=None, config=None)[source]¶
Bases:
ABCAn abstract base class for all JobExecutor implementations.
- Parameters
url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.
config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.
- abstract cancel(job)[source]¶
Cancels a job that has been submitted to underlying executor implementation.
A successful return of this method only indicates that the request for cancellation has been communicated to the underlying implementation. The job will then be canceled at the discretion of the implementation, which may be at some later time. A successful cancellation is reflected in a change of status of the respective job to
CANCELED. User code can synchronously wait until theCANCELEDstate is reached using job.wait(JobState.CANCELED) or even job.wait(), since the latter would wait for all final states, including JobState.CANCELED. In fact, it is recommended that job.wait() be used because it is entirely possible for the job to complete before the cancellation is communicated to the underlying implementation and before the client code receives the completion notification. In such a case, the job will never enter the CANCELED state and job.wait(JobState.CANCELED) would hang indefinitely.- Parameters
job (Job) – The job to be canceled.
- Raises
SubmitException – Thrown if the request cannot be sent to the underlying implementation.
- Return type
None
- static get_executor_names()[source]¶
Returns a set of registered executor names.
Names returned by this method can be passed to
get_instance()as the name parameter.
- static get_instance(name, version_constraint=None, url=None, config=None)[source]¶
Returns an instance of a JobExecutor.
- Parameters
name (str) – The name of the executor to return. This must be one of the values returned by
get_executor_names(). If the value of the name parameter is not one of the valid values returned byget_executor_names(), ValueError is raised.version_constraint (Optional[str]) – A version constraint for the executor in the form ‘(‘ <op> <version>[, <op> <version[, …]] ‘)’, such as “( > 0.0.2, != 0.0.4)”.
url (Optional[str]) – An optional URL to pass to the JobExecutor instance.
config (Optional[JobExecutorConfig]) – An optional configuration to pass to the instance.
- Returns
A JobExecutor.
- Return type
- abstract list()[source]¶
List native IDs of all jobs known to the backend.
This method is meant to return a list of native IDs for jobs submitted to the backend by any means, not necessarily through this executor or through PSI/J.
- static register_executor(desc, root)[source]¶
Registers a JobExecutor class through a
Descriptor.The class can then be later instantiated using
get_instance().- Parameters
desc (Descriptor) – A
Descriptorwith information about the executor to be registered.root (str) – A filesystem path under which the implementation of the executor is to be loaded from. Executors from other locations, even if under the correct package, will not be registered by this method. If an executor implementation is only available under a different root path, this method will throw an exception.
- Return type
None
- set_job_status_callback(cb)[source]¶
Registers a status callback with this executor.
The callback can either be a subclass of
JobStatusCallbackor a procedure accepting two arguments: aJoband aJobStatus.The callback will be invoked whenever a status change occurs for any of the jobs submitted to this job executor, whether they were submitted with an individual job status callback or not. To remove the callback, set it to None.
- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallbackor a callable with two parameters: job of typeJoband job_status of typeJobStatus.- Return type
None
- abstract submit(job)[source]¶
Submits a Job to the underlying implementation.
Successful return of this method indicates that the job has been sent to the underlying implementation and all changes in the job status, including failures, are reported using notifications. Conversely, if one of the two possible exceptions is thrown, then the job has not been successfully sent to the underlying implementation, the job status remains unchanged, and no status notifications about the job will be fired.
A successful return of this method guarantees that the job’s native_id property is set.
- Raises
InvalidJobException – Thrown if the job specification cannot be understood. This exception is fatal in that submitting another job with the exact same details will also fail with an InvalidJobException. In principle, the underlying implementation / LRM is the entity ultimately responsible for interpreting a specification and reporting any errors associated with it. However, in many cases, this reporting may come after a significant delay. In the interest of failing fast, library implementations should make an effort of validating specifications early and throwing this exception as soon as possible if that validation fails.
SubmitException – Thrown if the request cannot be sent to the underlying implementation. Unlike InvalidJobException, this exception can occur for reasons that are transient.
- Parameters
job (Job) –
- Return type
None
- property version: packaging.version.Version¶
Returns the version of this executor.
psij.job_executor_config module¶
- class JobExecutorConfig(launcher_log_file=None, work_directory=None)[source]¶
Bases:
objectAn abstract configuration class for
JobExecutorinstances.- Parameters
launcher_log_file (Optional[Path]) – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
work_directory (Optional[Path]) – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
- Return type
None
- DEFAULT: JobExecutorConfig = <psij.job_executor_config.JobExecutorConfig object>¶
A default JobExecutorConfig used when none is specified.
- DEFAULT_WORK_DIRECTORY = PosixPath('/home/runner/.psij/work')¶
The default work directory when a work directory is not explicitly specified.
- property launcher_log_file: Optional[Path]¶
Configure the executor’s launcher log file.
- Parameters
launcher_log_file – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
- property work_directory: Path¶
Configure the execor’s work directory.
- Parameters
work_directory – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
psij.job_launcher module¶
This module contains the core classes of the launchers infrastructure.
- class Launcher(config=None)[source]¶
Bases:
ABCAn abstract base class for all launchers.
- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration. If not specified,
DEFAULTis used.- Return type
None
- DEFAULT_LAUNCHER_NAME = 'single'¶
- static get_instance(name, version_constraint=None, config=None)[source]¶
Returns an instance of a launcher optionally configured using a certain configuration.
The returned instance may or may not be a singleton object.
- abstract get_launch_command(job)[source]¶
Constructs a command to launch a job given a job specification.
- static get_launcher_names()[source]¶
Returns a set of registered launcher names.
Names returned by this method can be passed to
get_instance()as the name parameter.
- static register_launcher(desc, root)[source]¶
Registers a launcher class.
The registered class can then be instantiated using
get_instance().- Parameters
desc (Descriptor) – A
Descriptorwith information about the launcher to register.root (str) – A filesystem path under which the implementation of the launcher is to be loaded from. Launchers from other locations, even if under the correct package, will not be registered by this method. If a launcher implementation is only available under a different root path, this method will throw an exception.
- Return type
None
psij.job_spec module¶
- class JobSpec(executable=None, arguments=None, directory=None, name=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None, stage_in=None, stage_out=None, cleanup=None, cleanup_flags=StageOutFlags.ALWAYS)[source]¶
Bases:
objectA class that describes the details of a job.
- Parameters
executable (Optional[str]) – An executable, such as “/bin/date”.
arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.
directory (Union[str, Path, None]) – The directory, on the compute side, in which the executable is to be run
name (Optional[str]) – A name for the job. The name plays no functional role except that
JobExecutorimplementations may attempt to use the name to label the job as presented by the underlying implementation.inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.
environment (Optional[Dict[str, Union[str, int]]]) – A mapping of environment variable names to their respective values.
stdin_path (Union[str, Path, None]) – Path to a file whose contents will be sent to the job’s standard input.
stdout_path (Union[str, Path, None]) – A path to a file in which to place the standard output stream of the job.
stderr_path (Union[str, Path, None]) – A path to a file in which to place the standard error stream of the job.
resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.
attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.
pre_launch (Union[str, Path, None]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.
post_launch (Union[str, Path, None]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.
launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers, see Available Launchers.
stage_in (Optional[Set[StageIn]]) – Specifies a set of files to be staged in before the job is launched.
stage_out (Optional[Set[StageOut]]) – Specifies a set of files to be staged out after the job terminates.
cleanup (Optional[Set[Union[str, Path]]]) – Specifies a set of files to remove after the stage out process.
cleanup_flags (StageOutFlags) – Specifies the conditions under which the files in cleanup should be removed, such as when the job completes successfully. The flag StageOutFlags.IF_PRESENT is ignored and no error condition is triggered if a file specified by the cleanup argument is not present.
All constructor parameters are accessible as properties.
Note
A note about paths.
It is strongly recommended that paths to std*_path, directory, etc. be specified as absolute. While paths can be relative, and there are cases when it is desirable to specify them as relative, it is important to understand what the implications are.
Paths in a specification refer to paths that are accessible to the machine where the job is running. In most cases, that will be different from the machine on which the job is launched (i.e., where PSI/J is invoked from). This means that a given path may or may not point to the same file in both the location where the job is running and the location where the job is launched from.
For example, if launching jobs from a login node of a cluster, the path /tmp/foo.txt will likely refer to locally mounted drives on both the login node and the compute node(s) where the job is running. However, since they are local mounts, the file /tmp/foo.txt written by a job running on the compute node will not be visible by opening /tmp/foo.txt on the login node. If an output file written on a compute node needs to be accessed on a login node, that file should be placed on a shared filesystem. However, even by doing so, there is no guarantee that the shared filesystem is mounted under the same mount point on both login and compute nodes. While this is an unlikely scenario, it remains a possibility.
When relative paths are specified, even when they point to files on a shared filesystem as seen from the submission side (i.e., login node), the job working directory may be different from the working directory of the application that is launching the job. For example, an application that uses PSI/J to launch jobs on a cluster may be invoked from (and have its working directory set to) /home/foo, where /home is a mount point for a shared filesystem accessible by compute nodes. The launched job may specify stdout_path=Path(‘bar.txt’), which would resolve to /home/foo/bar.txt. However, the job may start in /tmp on the compute node, and its standard output will be redirected to /tmp/bar.txt.
Relative paths are useful when there is a need to refer to the job directory that the scheduler chooses for the job, which is not generally known until the job is started by the scheduler. In such a case, one must leave the spec.directory attribute empty and refer to files inside the job directory using relative paths.
- property directory: Optional[Path]¶
The directory, on the compute side, in which the executable is to be run.
- property post_launch: Optional[Path]¶
An optional path to a post-launch script.
The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.
- property pre_launch: Optional[Path]¶
An optional path to a pre-launch script.
The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.
- property stderr_path: Optional[Path]¶
A path to a file in which to place the standard error stream of the job.
psij.job_state module¶
- class JobState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
-
An enumeration holding the possible job states.
The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.
- ACTIVE = 3¶
This state represents an actively running job.
- CLEANUP = 5¶
This state indicates that cleanup is actively being done for this job.
- COMPLETED = 6¶
This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.
- FAILED = 7¶
Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.
- NEW = 0¶
This is the state of a job immediately after the
Jobobject is created and before being submitted to aJobExecutor.
- QUEUED = 1¶
This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.
- STAGE_IN = 2¶
This state indicates that the job is staging files in, in preparation for execution.
- STAGE_OUT = 4¶
This state indicates that the executable has finished running and that files are being staged out.
- property final: bool¶
Returns True if this state final.
A state is final when no other state transition can occur after that state has been reached.
- Returns
True if this is a final state and False otherwise
- static from_name(name)[source]¶
Returns a JobState object corresponding to its string representation.
This method is such that state == JobState.from_name(str(state)).
- is_greater_than(other)[source]¶
Defines a (strict) partial ordering on the states.
Not all states are comparable. State transitions cannot violate this ordering.
- Parameters
other (JobState) – the other JobState to compare to
- Returns
if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.
- Return type
- class JobStateOrder[source]¶
Bases:
objectA class that can be used to reconstruct missing states.
- static prev(state)[source]¶
Returns the state previous to the given state.
The “previous” state is a state that must have occurred immediately prior to this state given the state transition diagram if such a state is unique. Not all states have a previous state. For example, the FAILED state does not have a previous state, since it can be reached from multiple states.
psij.job_status module¶
- class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]¶
Bases:
objectA class containing details about job transitions to new states.
- Parameters
time (Optional[float]) – The time, as would be returned by
time.time(), at which the transition to the new state occurred. If not specified, the time when this JobStatus was instantiated will be used.message (Optional[str]) – An optional message associated with the transition.
exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.
metadata (Optional[Dict[str, object]]) – Optional metadata provided by the
JobExecutor.
- Return type
None
All constructor parameters are accessible as properties.
psij.launcher module¶
psij.resource_spec module¶
- class ResourceSpec[source]¶
Bases:
ABCA base class for resource specifications.
The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.
- class ResourceSpecV1(node_count=None, process_count=None, processes_per_node=None, cpu_cores_per_process=None, gpu_cores_per_process=None, exclusive_node_use=False, memory=None)[source]¶
Bases:
ResourceSpecThis class implements V1 of the PSI/J resource specification.
Some of the properties of this class are constrained. Specifically, process_count = node_count * processes_per_node. Specifying all constrained properties in a way that does not satisfy the constraint will result in an error. Specifying some of the constrained properties will result in the remaining one being inferred based on the constraint. This inference is done by this class. However, executor implementations may chose to delegate this inference to an underlying implementation and ignore the values inferred by this class.
- Parameters
node_count (Optional[int]) – If specified, request that the backend allocate this many compute nodes for the job.
process_count (Optional[int]) – If specified, instruct the backend to start this many process instances. This defaults to 1.
processes_per_node (Optional[int]) – Instruct the backend to run this many process instances on each node.
cpu_cores_per_process (Optional[int]) – Request this many CPU cores for each process instance. This property is used by a backend to calculate the number of nodes from the process_count
gpu_cores_per_process (Optional[int]) – Request this many GPU cores for each process instance.
exclusive_node_use (bool) – If this parameter is set to True, the LRM is instructed to allocate to this job only nodes that are not running any other jobs, even if this job is requesting fewer cores than the total number of cores on a node. With this parameter set to False, which is the default, the LRM is free to co-schedule multiple jobs on a given node if the number of cores requested by those jobs total less than the amount available on the node.
memory (Optional[int]) – The total amount, in bytes, of memory requested for the job.
- Return type
None
All constructor parameters are accessible as properties.
- property computed_node_count: int¶
Returns or calculates a node count.
If the node_count property is specified, this method returns it. If not, a node count is calculated from process_count and processes_per_node.
- Returns
An integer value with the specified or calculated node count.
- property computed_process_count: int¶
Returns or calculates a process count.
If the process_count property is specified, this method returns it, otherwise it returns 1.
- Returns
An integer value with either the value of process_count or one if the former is not specified.
- property computed_processes_per_node: int¶
Returns or calculates the number of processes per node.
If the processes_per_node property is specified, this method returns it, otherwise calculates it based on process_count and node_count if possible, or defaults to 1.
- Returns
An integer value with either the value of processes_per_node or one if the former cannot be determined.
psij.serialize module¶
- class JSONSerializer[source]¶
Bases:
SerializerA JSON serializer.
- class Serializer[source]¶
Bases:
ABCA base class for serializers.
This class takes care of converting a
JobSpecinstance, including all its properties, into an intermediate representation consisting of a tree of standard dictionaries and lists, where dictionary keys are guaranteed to be strings and values are limited to dictionaries, lists, str, int, and bool. It also takes care of making the reverse conversion. Concrete implementations of serializers should extend this class and implement the _dump_dict and _load_dict methods, which convert the intermediate representation to the actual serialized format.Serializer implementations can also directly override the dump, dumps, load, and loads methods to bypass the intermediate representations and implement (de)serialization directly.
- dumps(spec)[source]¶
Serialize the given
JobSpecto a string.Serializer implementations that use a binary protocol must override this method and raise an error.
psij.utils module¶
- class SingletonThread(name=None, daemon=False)[source]¶
Bases:
ThreadA convenience class to return a thread that is guaranteed to be unique to this process.
This is intended to work with fork() to ensure that each os.getpid() value is associated with at most one thread. This is not safe. The safe thing, as pointed out by the fork() man page, is to not use fork() with threads. However, this is here in an attempt to make it slightly safer for when users really really want to take the risk against all advice.
This class is meant as an abstract class and should be used by subclassing and implementing the run method.
Instantiation of this class or one of its subclasses should be done through the
get_instance()method rather than directly.- Parameters
- Return type
None
psij.version module¶
This module stores the current version of this library.
Module contents¶
The package containing the jobs module of this PSI implementation.
- exception InvalidJobException(message, exception=None)[source]¶
Bases:
ExceptionAn exception describing a problem with a job specification.
- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- class Job(spec=None)[source]¶
Bases:
objectThis class represents a PSI/J job.
It encapsulates all of the information needed to run a job as well as the job’s state.
When constructed, a job is in the
NEWstate.- Parameters
spec (Optional[JobSpec]) – an optional
JobSpecthat describes the details of the job.- Return type
None
- cancel()[source]¶
Cancels this job.
The job is canceled by calling
cancel()on the job executor that was used to submit this job.- Raises
SubmitException – if the job has not yet been submitted.
- Return type
None
- executor: Optional[JobExecutor]¶
- property id: str¶
A read-only property containing the PSI/J job ID.
The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.
- property native_id: Optional[str]¶
A read-only property containing the native ID of the job.
The native ID is the ID assigned to the job by the underlying implementation. The native ID may not be available until after the job is submitted to a
JobExecutor, in which case the value of this property isNone.
- set_job_status_callback(cb)[source]¶
Registers a status callback with this job.
The callback can either be a subclass of
JobStatusCallbackor a procedure accepting two arguments: aJoband aJobStatus.The callback is invoked whenever a status change occurs for this job, independent of any callback registered on the job’s
JobExecutor. The callback can be removed by setting this property toNone.- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallbackor a callable with two parameters,jobof typeJob,job_statusof typeJobStatus, and returning nothing.- Return type
None
- spec¶
The job specification of this job.
- property status: JobStatus¶
Contains the current status of the job.
It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of
JobStatustypes. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.- Returns
the current state of this job
- wait(timeout=None, target_states=None)[source]¶
Waits for the job to reach certain states.
This method returns either when the job reaches one of the target_states, a state following one of the target_states, a final state, or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the
JobStatusobject that has one of the desired states or None if the timeout is reached. For example, wait(target_states = [JobState.QUEUED] waits until the job is in any of the QUEUED, ACTIVE, COMPLETED, FAILED, or CANCELED states.- Parameters
timeout (Optional[timedelta]) – An optional timeout after which this method returns even if none of the target_states was reached. If not specified, wait indefinitely.
target_states (Optional[Union[JobState, Sequence[JobState]]]) – A set of states to wait for. If not specified, wait for any of the
finalstates.
- Returns
returns the
JobStatusobject that caused the caused this call to complete or None if the timeout is specified and reached.- Return type
- class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, account=None, reservation_id=None, custom_attributes=None, project_name=None)[source]¶
Bases:
objectA class containing ancillary job information that describes how a job is to be run.
- Parameters
duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.
queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.
account (Optional[str]) – An account to use for billing purposes. Please note that the executor implementation (or batch scheduler) may use a different term for the option used for accounting/billing purposes, such as project. However, scheduler must map this attribute to the accounting/billing option in the underlying execution mechanism.
reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.
custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of
JobExecutordefine and are responsible for interpreting custom attributes. The typical usage scenario for custom attributes is to pass information to the executor or underlying job execution mechanism that cannot otherwise be passed using the classes and properties provided by PSI/J. A specific example is that of the subclasses ofBatchSchedulerExecutor, which look for custom attributes prefixed with their name and a dot (e.g., slurm.constraint, pbs.c, lsf.core_isolation) and translate them into the corresponding batch scheduler directives (e.g., #SLURM –constraint=…, #PBS -c …, #BSUB -core_isolation …).project_name (Optional[str]) – Deprecated. Please use the account attribute.
- Return type
None
All constructor parameters are accessible as properties.
- property custom_attributes: Optional[Dict[str, object]]¶
Returns a dictionary with the custom attributes.
- class JobExecutor(url=None, config=None)[source]¶
Bases:
ABCAn abstract base class for all JobExecutor implementations.
- Parameters
url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.
config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.
- abstract cancel(job)[source]¶
Cancels a job that has been submitted to underlying executor implementation.
A successful return of this method only indicates that the request for cancellation has been communicated to the underlying implementation. The job will then be canceled at the discretion of the implementation, which may be at some later time. A successful cancellation is reflected in a change of status of the respective job to
CANCELED. User code can synchronously wait until theCANCELEDstate is reached using job.wait(JobState.CANCELED) or even job.wait(), since the latter would wait for all final states, including JobState.CANCELED. In fact, it is recommended that job.wait() be used because it is entirely possible for the job to complete before the cancellation is communicated to the underlying implementation and before the client code receives the completion notification. In such a case, the job will never enter the CANCELED state and job.wait(JobState.CANCELED) would hang indefinitely.- Parameters
job (Job) – The job to be canceled.
- Raises
SubmitException – Thrown if the request cannot be sent to the underlying implementation.
- Return type
None
- static get_executor_names()[source]¶
Returns a set of registered executor names.
Names returned by this method can be passed to
get_instance()as the name parameter.
- static get_instance(name, version_constraint=None, url=None, config=None)[source]¶
Returns an instance of a JobExecutor.
- Parameters
name (str) – The name of the executor to return. This must be one of the values returned by
get_executor_names(). If the value of the name parameter is not one of the valid values returned byget_executor_names(), ValueError is raised.version_constraint (Optional[str]) – A version constraint for the executor in the form ‘(‘ <op> <version>[, <op> <version[, …]] ‘)’, such as “( > 0.0.2, != 0.0.4)”.
url (Optional[str]) – An optional URL to pass to the JobExecutor instance.
config (Optional[JobExecutorConfig]) – An optional configuration to pass to the instance.
- Returns
A JobExecutor.
- Return type
- abstract list()[source]¶
List native IDs of all jobs known to the backend.
This method is meant to return a list of native IDs for jobs submitted to the backend by any means, not necessarily through this executor or through PSI/J.
- static register_executor(desc, root)[source]¶
Registers a JobExecutor class through a
Descriptor.The class can then be later instantiated using
get_instance().- Parameters
desc (Descriptor) – A
Descriptorwith information about the executor to be registered.root (str) – A filesystem path under which the implementation of the executor is to be loaded from. Executors from other locations, even if under the correct package, will not be registered by this method. If an executor implementation is only available under a different root path, this method will throw an exception.
- Return type
None
- set_job_status_callback(cb)[source]¶
Registers a status callback with this executor.
The callback can either be a subclass of
JobStatusCallbackor a procedure accepting two arguments: aJoband aJobStatus.The callback will be invoked whenever a status change occurs for any of the jobs submitted to this job executor, whether they were submitted with an individual job status callback or not. To remove the callback, set it to None.
- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallbackor a callable with two parameters: job of typeJoband job_status of typeJobStatus.- Return type
None
- abstract submit(job)[source]¶
Submits a Job to the underlying implementation.
Successful return of this method indicates that the job has been sent to the underlying implementation and all changes in the job status, including failures, are reported using notifications. Conversely, if one of the two possible exceptions is thrown, then the job has not been successfully sent to the underlying implementation, the job status remains unchanged, and no status notifications about the job will be fired.
A successful return of this method guarantees that the job’s native_id property is set.
- Raises
InvalidJobException – Thrown if the job specification cannot be understood. This exception is fatal in that submitting another job with the exact same details will also fail with an InvalidJobException. In principle, the underlying implementation / LRM is the entity ultimately responsible for interpreting a specification and reporting any errors associated with it. However, in many cases, this reporting may come after a significant delay. In the interest of failing fast, library implementations should make an effort of validating specifications early and throwing this exception as soon as possible if that validation fails.
SubmitException – Thrown if the request cannot be sent to the underlying implementation. Unlike InvalidJobException, this exception can occur for reasons that are transient.
- Parameters
job (Job) –
- Return type
None
- property version: packaging.version.Version¶
Returns the version of this executor.
- class JobExecutorConfig(launcher_log_file=None, work_directory=None)[source]¶
Bases:
objectAn abstract configuration class for
JobExecutorinstances.- Parameters
launcher_log_file (Optional[Path]) – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
work_directory (Optional[Path]) – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
- Return type
None
- DEFAULT: JobExecutorConfig = <psij.job_executor_config.JobExecutorConfig object>¶
A default JobExecutorConfig used when none is specified.
- DEFAULT_WORK_DIRECTORY = PosixPath('/home/runner/.psij/work')¶
The default work directory when a work directory is not explicitly specified.
- property launcher_log_file: Optional[Path]¶
Configure the executor’s launcher log file.
- Parameters
launcher_log_file – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
- property work_directory: Path¶
Configure the execor’s work directory.
- Parameters
work_directory – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
- class JobSpec(executable=None, arguments=None, directory=None, name=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None, stage_in=None, stage_out=None, cleanup=None, cleanup_flags=StageOutFlags.ALWAYS)[source]¶
Bases:
objectA class that describes the details of a job.
- Parameters
executable (Optional[str]) – An executable, such as “/bin/date”.
arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.
directory (Union[str, Path, None]) – The directory, on the compute side, in which the executable is to be run
name (Optional[str]) – A name for the job. The name plays no functional role except that
JobExecutorimplementations may attempt to use the name to label the job as presented by the underlying implementation.inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.
environment (Optional[Dict[str, Union[str, int]]]) – A mapping of environment variable names to their respective values.
stdin_path (Union[str, Path, None]) – Path to a file whose contents will be sent to the job’s standard input.
stdout_path (Union[str, Path, None]) – A path to a file in which to place the standard output stream of the job.
stderr_path (Union[str, Path, None]) – A path to a file in which to place the standard error stream of the job.
resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.
attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.
pre_launch (Union[str, Path, None]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.
post_launch (Union[str, Path, None]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.
launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers, see Available Launchers.
stage_in (Optional[Set[StageIn]]) – Specifies a set of files to be staged in before the job is launched.
stage_out (Optional[Set[StageOut]]) – Specifies a set of files to be staged out after the job terminates.
cleanup (Optional[Set[Union[str, Path]]]) – Specifies a set of files to remove after the stage out process.
cleanup_flags (StageOutFlags) – Specifies the conditions under which the files in cleanup should be removed, such as when the job completes successfully. The flag StageOutFlags.IF_PRESENT is ignored and no error condition is triggered if a file specified by the cleanup argument is not present.
All constructor parameters are accessible as properties.
Note
A note about paths.
It is strongly recommended that paths to std*_path, directory, etc. be specified as absolute. While paths can be relative, and there are cases when it is desirable to specify them as relative, it is important to understand what the implications are.
Paths in a specification refer to paths that are accessible to the machine where the job is running. In most cases, that will be different from the machine on which the job is launched (i.e., where PSI/J is invoked from). This means that a given path may or may not point to the same file in both the location where the job is running and the location where the job is launched from.
For example, if launching jobs from a login node of a cluster, the path /tmp/foo.txt will likely refer to locally mounted drives on both the login node and the compute node(s) where the job is running. However, since they are local mounts, the file /tmp/foo.txt written by a job running on the compute node will not be visible by opening /tmp/foo.txt on the login node. If an output file written on a compute node needs to be accessed on a login node, that file should be placed on a shared filesystem. However, even by doing so, there is no guarantee that the shared filesystem is mounted under the same mount point on both login and compute nodes. While this is an unlikely scenario, it remains a possibility.
When relative paths are specified, even when they point to files on a shared filesystem as seen from the submission side (i.e., login node), the job working directory may be different from the working directory of the application that is launching the job. For example, an application that uses PSI/J to launch jobs on a cluster may be invoked from (and have its working directory set to) /home/foo, where /home is a mount point for a shared filesystem accessible by compute nodes. The launched job may specify stdout_path=Path(‘bar.txt’), which would resolve to /home/foo/bar.txt. However, the job may start in /tmp on the compute node, and its standard output will be redirected to /tmp/bar.txt.
Relative paths are useful when there is a need to refer to the job directory that the scheduler chooses for the job, which is not generally known until the job is started by the scheduler. In such a case, one must leave the spec.directory attribute empty and refer to files inside the job directory using relative paths.
- property directory: Optional[Path]¶
The directory, on the compute side, in which the executable is to be run.
- property post_launch: Optional[Path]¶
An optional path to a post-launch script.
The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.
- property pre_launch: Optional[Path]¶
An optional path to a pre-launch script.
The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.
- property stderr_path: Optional[Path]¶
A path to a file in which to place the standard error stream of the job.
- class JobState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
-
An enumeration holding the possible job states.
The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.
- ACTIVE = 3¶
This state represents an actively running job.
- CLEANUP = 5¶
This state indicates that cleanup is actively being done for this job.
- COMPLETED = 6¶
This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.
- FAILED = 7¶
Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.
- NEW = 0¶
This is the state of a job immediately after the
Jobobject is created and before being submitted to aJobExecutor.
- QUEUED = 1¶
This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.
- STAGE_IN = 2¶
This state indicates that the job is staging files in, in preparation for execution.
- STAGE_OUT = 4¶
This state indicates that the executable has finished running and that files are being staged out.
- property final: bool¶
Returns True if this state final.
A state is final when no other state transition can occur after that state has been reached.
- Returns
True if this is a final state and False otherwise
- static from_name(name)[source]¶
Returns a JobState object corresponding to its string representation.
This method is such that state == JobState.from_name(str(state)).
- is_greater_than(other)[source]¶
Defines a (strict) partial ordering on the states.
Not all states are comparable. State transitions cannot violate this ordering.
- Parameters
other (JobState) – the other JobState to compare to
- Returns
if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.
- Return type
- class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]¶
Bases:
objectA class containing details about job transitions to new states.
- Parameters
time (Optional[float]) – The time, as would be returned by
time.time(), at which the transition to the new state occurred. If not specified, the time when this JobStatus was instantiated will be used.message (Optional[str]) – An optional message associated with the transition.
exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.
metadata (Optional[Dict[str, object]]) – Optional metadata provided by the
JobExecutor.
- Return type
None
All constructor parameters are accessible as properties.
- class JobStatusCallback[source]¶
Bases:
ABCAn interface used to listen to job status change events.
- abstract job_status_changed(job, job_status)[source]¶
This method is invoked when a status change occurs on a job.
Client code interested in receiving status notifications must implement this method. It is entirely possible that
psij.Job.statuswhen referenced from the body of this method would return something different from the status passed to this callback. This is because the status of the job can be updated during the execution of the body of this method and, in particular, before the potential dereference topsij.Job.statusis made.Client code implementing this method must return quickly and cannot be used for lengthy processing. Furthermore, client code implementing this method should not throw exceptions.
- class Launcher(config=None)[source]¶
Bases:
ABCAn abstract base class for all launchers.
- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration. If not specified,
DEFAULTis used.- Return type
None
- DEFAULT_LAUNCHER_NAME = 'single'¶
- static get_instance(name, version_constraint=None, config=None)[source]¶
Returns an instance of a launcher optionally configured using a certain configuration.
The returned instance may or may not be a singleton object.
- abstract get_launch_command(job)[source]¶
Constructs a command to launch a job given a job specification.
- static get_launcher_names()[source]¶
Returns a set of registered launcher names.
Names returned by this method can be passed to
get_instance()as the name parameter.
- static register_launcher(desc, root)[source]¶
Registers a launcher class.
The registered class can then be instantiated using
get_instance().- Parameters
desc (Descriptor) – A
Descriptorwith information about the launcher to register.root (str) – A filesystem path under which the implementation of the launcher is to be loaded from. Launchers from other locations, even if under the correct package, will not be registered by this method. If a launcher implementation is only available under a different root path, this method will throw an exception.
- Return type
None
- class ResourceSpec[source]¶
Bases:
ABCA base class for resource specifications.
The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.
- class ResourceSpecV1(node_count=None, process_count=None, processes_per_node=None, cpu_cores_per_process=None, gpu_cores_per_process=None, exclusive_node_use=False, memory=None)[source]¶
Bases:
ResourceSpecThis class implements V1 of the PSI/J resource specification.
Some of the properties of this class are constrained. Specifically, process_count = node_count * processes_per_node. Specifying all constrained properties in a way that does not satisfy the constraint will result in an error. Specifying some of the constrained properties will result in the remaining one being inferred based on the constraint. This inference is done by this class. However, executor implementations may chose to delegate this inference to an underlying implementation and ignore the values inferred by this class.
- Parameters
node_count (Optional[int]) – If specified, request that the backend allocate this many compute nodes for the job.
process_count (Optional[int]) – If specified, instruct the backend to start this many process instances. This defaults to 1.
processes_per_node (Optional[int]) – Instruct the backend to run this many process instances on each node.
cpu_cores_per_process (Optional[int]) – Request this many CPU cores for each process instance. This property is used by a backend to calculate the number of nodes from the process_count
gpu_cores_per_process (Optional[int]) – Request this many GPU cores for each process instance.
exclusive_node_use (bool) – If this parameter is set to True, the LRM is instructed to allocate to this job only nodes that are not running any other jobs, even if this job is requesting fewer cores than the total number of cores on a node. With this parameter set to False, which is the default, the LRM is free to co-schedule multiple jobs on a given node if the number of cores requested by those jobs total less than the amount available on the node.
memory (Optional[int]) – The total amount, in bytes, of memory requested for the job.
- Return type
None
All constructor parameters are accessible as properties.
- property computed_node_count: int¶
Returns or calculates a node count.
If the node_count property is specified, this method returns it. If not, a node count is calculated from process_count and processes_per_node.
- Returns
An integer value with the specified or calculated node count.
- property computed_process_count: int¶
Returns or calculates a process count.
If the process_count property is specified, this method returns it, otherwise it returns 1.
- Returns
An integer value with either the value of process_count or one if the former is not specified.
- property computed_processes_per_node: int¶
Returns or calculates the number of processes per node.
If the processes_per_node property is specified, this method returns it, otherwise calculates it based on process_count and node_count if possible, or defaults to 1.
- Returns
An integer value with either the value of processes_per_node or one if the former cannot be determined.
- exception SubmitException(message, exception=None, transient=False)[source]¶
Bases:
ExceptionAn exception representing job submission issues.
This exception is thrown when the
submit()call fails for a reason that is independent of the job that is being submitted.- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- transient¶
Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.