PSI/J Core¶
Job¶
- class Job(spec=None)[source]¶
Bases:
object
This class represents a PSI/J job.
It encapsulates all of the information needed to run a job as well as the job’s state.
When constructed, a job is in the
NEW
state.- Parameters
spec (Optional[JobSpec]) – an optional
JobSpec
that describes the details of the job.- Return type
None
- cancel()[source]¶
Cancels this job.
The job is canceled by calling
cancel()
on the job executor that was used to submit this job.- Raises
SubmitException – if the job has not yet been submitted.
- Return type
None
- property id: str¶
A read-only property containing the PSI/J job ID.
The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.
- property native_id: Optional[str]¶
A read-only property containing the native ID of the job.
The native ID is the ID assigned to the job by the underlying implementation. The native ID may not be available until after the job is submitted to a
JobExecutor
, in which case the value of this property isNone
.
- set_job_status_callback(cb)[source]¶
Registers a status callback with this job.
The callback can either be a subclass of
JobStatusCallback
or a procedure accepting two arguments: aJob
and aJobStatus
.The callback is invoked whenever a status change occurs for this job, independent of any callback registered on the job’s
JobExecutor
. The callback can be removed by setting this property toNone
.- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallback
or a callable with two parameters,job
of typeJob
,job_status
of typeJobStatus
, and returning nothing.- Return type
None
- spec¶
The job specification of this job.
- property status: JobStatus¶
Contains the current status of the job.
It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of
JobStatus
types. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.- Returns
the current state of this job
- wait(timeout=None, target_states=None)[source]¶
Waits for the job to reach certain states.
This method returns either when the job reaches one of the target_states, a state following one of the target_states, a final state, or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the
JobStatus
object that has one of the desired states or None if the timeout is reached. For example, wait(target_states = [JobState.QUEUED] waits until the job is in any of the QUEUED, ACTIVE, COMPLETED, FAILED, or CANCELED states.- Parameters
timeout (Optional[timedelta]) – An optional timeout after which this method returns even if none of the target_states was reached. If not specified, wait indefinitely.
target_states (Optional[Union[JobState, Sequence[JobState]]]) – A set of states to wait for. If not specified, wait for any of the
final
states.
- Returns
returns the
JobStatus
object that caused the caused this call to complete or None if the timeout is specified and reached.- Return type
JobSpec¶
- class JobSpec(executable=None, arguments=None, directory=None, name=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None)[source]¶
Bases:
object
A class that describes the details of a job.
- Parameters
executable (Optional[str]) – An executable, such as “/bin/date”.
arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.
directory (Union[str, Path, None]) – The directory, on the compute side, in which the executable is to be run
name (Optional[str]) – A name for the job. The name plays no functional role except that
JobExecutor
implementations may attempt to use the name to label the job as presented by the underlying implementation.inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.
environment (Optional[Dict[str, Union[str, int]]]) – A mapping of environment variable names to their respective values.
stdin_path (Union[str, Path, None]) – Path to a file whose contents will be sent to the job’s standard input.
stdout_path (Union[str, Path, None]) – A path to a file in which to place the standard output stream of the job.
stderr_path (Union[str, Path, None]) – A path to a file in which to place the standard error stream of the job.
resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.
attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.
pre_launch (Union[str, Path, None]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.
post_launch (Union[str, Path, None]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.
launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers, see Available Launchers.
All constructor parameters are accessible as properties.
Note
A note about paths.
It is strongly recommended that paths to std*_path, directory, etc. be specified as absolute. While paths can be relative, and there are cases when it is desirable to specify them as relative, it is important to understand what the implications are.
Paths in a specification refer to paths that are accessible to the machine where the job is running. In most cases, that will be different from the machine on which the job is launched (i.e., where PSI/J is invoked from). This means that a given path may or may not point to the same file in both the location where the job is running and the location where the job is launched from.
For example, if launching jobs from a login node of a cluster, the path /tmp/foo.txt will likely refer to locally mounted drives on both the login node and the compute node(s) where the job is running. However, since they are local mounts, the file /tmp/foo.txt written by a job running on the compute node will not be visible by opening /tmp/foo.txt on the login node. If an output file written on a compute node needs to be accessed on a login node, that file should be placed on a shared filesystem. However, even by doing so, there is no guarantee that the shared filesystem is mounted under the same mount point on both login and compute nodes. While this is an unlikely scenario, it remains a possibility.
When relative paths are specified, even when they point to files on a shared filesystem as seen from the submission side (i.e., login node), the job working directory may be different from the working directory of the application that is launching the job. For example, an application that uses PSI/J to launch jobs on a cluster may be invoked from (and have its working directory set to) /home/foo, where /home is a mount point for a shared filesystem accessible by compute nodes. The launched job may specify stdout_path=Path(‘bar.txt’), which would resolve to /home/foo/bar.txt. However, the job may start in /tmp on the compute node, and its standard output will be redirected to /tmp/bar.txt.
Relative paths are useful when there is a need to refer to the job directory that the scheduler chooses for the job, which is not generally known until the job is started by the scheduler. In such a case, one must leave the spec.directory attribute empty and refer to files inside the job directory using relative paths.
- property directory: Optional[Path]¶
The directory, on the compute side, in which the executable is to be run.
- property post_launch: Optional[Path]¶
An optional path to a post-launch script.
The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.
- property pre_launch: Optional[Path]¶
An optional path to a pre-launch script.
The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.
- property stderr_path: Optional[Path]¶
A path to a file in which to place the standard error stream of the job.
JobAttributes¶
- class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, account=None, reservation_id=None, custom_attributes=None, project_name=None)[source]¶
Bases:
object
A class containing ancillary job information that describes how a job is to be run.
- Parameters
duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.
queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.
account (Optional[str]) – An account to use for billing purposes. Please note that the executor implementation (or batch scheduler) may use a different term for the option used for accounting/billing purposes, such as project. However, scheduler must map this attribute to the accounting/billing option in the underlying execution mechanism.
reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.
custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of
JobExecutor
define and are responsible for interpreting custom attributes. The typical usage scenario for custom attributes is to pass information to the executor or underlying job execution mechanism that cannot otherwise be passed using the classes and properties provided by PSI/J. A specific example is that of the subclasses ofBatchSchedulerExecutor
, which look for custom attributes prefixed with their name and a dot (e.g., slurm.constraint, pbs.c, lsf.core_isolation) and translate them into the corresponding batch scheduler directives (e.g., #SLURM –constraint=…, #PBS -c …, #BSUB -core_isolation …).project_name (Optional[str]) – Deprecated. Please use the account attribute.
- Return type
None
All constructor parameters are accessible as properties.
- property custom_attributes: Optional[Dict[str, object]]¶
Returns a dictionary with the custom attributes.
ResourceSpec¶
- class ResourceSpec[source]¶
Bases:
ABC
A base class for resource specifications.
The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.
ResourceSpecV1¶
- class ResourceSpecV1(node_count=None, process_count=None, processes_per_node=None, cpu_cores_per_process=None, gpu_cores_per_process=None, exclusive_node_use=False)[source]¶
Bases:
ResourceSpec
This class implements V1 of the PSI/J resource specification.
Some of the properties of this class are constrained. Specifically, process_count = node_count * processes_per_node. Specifying all constrained properties in a way that does not satisfy the constraint will result in an error. Specifying some of the constrained properties will result in the remaining one being inferred based on the constraint. This inference is done by this class. However, executor implementations may chose to delegate this inference to an underlying implementation and ignore the values inferred by this class.
- Parameters
node_count (Optional[int]) – If specified, request that the backend allocate this many compute nodes for the job.
process_count (Optional[int]) – If specified, instruct the backend to start this many process instances. This defaults to 1.
processes_per_node (Optional[int]) – Instruct the backend to run this many process instances on each node.
cpu_cores_per_process (Optional[int]) – Request this many CPU cores for each process instance. This property is used by a backend to calculate the number of nodes from the process_count
gpu_cores_per_process (Optional[int]) – Request this many GPU cores for each process instance.
exclusive_node_use (bool) – If this parameter is set to True, the LRM is instructed to allocate to this job only nodes that are not running any other jobs, even if this job is requesting fewer cores than the total number of cores on a node. With this parameter set to False, which is the default, the LRM is free to co-schedule multiple jobs on a given node if the number of cores requested by those jobs total less than the amount available on the node.
- Return type
None
All constructor parameters are accessible as properties.
- property computed_node_count: int¶
Returns or calculates a node count.
If the node_count property is specified, this method returns it. If not, a node count is calculated from process_count and processes_per_node.
- Returns
An integer value with the specified or calculated node count.
- property computed_process_count: int¶
Returns or calculates a process count.
If the process_count property is specified, this method returns it, otherwise it returns 1.
- Returns
An integer value with either the value of process_count or one if the former is not specified.
- property computed_processes_per_node: int¶
Returns or calculates the number of processes per node.
If the processes_per_node property is specified, this method returns it, otherwise calculates it based on process_count and node_count if possible, or defaults to 1.
- Returns
An integer value with either the value of processes_per_node or one if the former cannot be determined.
JobStatus¶
- class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]¶
Bases:
object
A class containing details about job transitions to new states.
- Parameters
time (Optional[float]) – The time, as would be returned by
time.time()
, at which the transition to the new state occurred. If not specified, the time when this JobStatus was instantiated will be used.message (Optional[str]) – An optional message associated with the transition.
exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.
metadata (Optional[Dict[str, object]]) – Optional metadata provided by the
JobExecutor
.
- Return type
None
All constructor parameters are accessible as properties.
JobState¶
- class JobState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
-
An enumeration holding the possible job states.
The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.
- ACTIVE = 2¶
This state represents an actively running job.
- COMPLETED = 3¶
This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.
- FAILED = 4¶
Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.
- NEW = 0¶
This is the state of a job immediately after the
Job
object is created and before being submitted to aJobExecutor
.
- QUEUED = 1¶
This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.
- property final: bool¶
Returns True if this state final.
A state is final when no other state transition can occur after that state has been reached.
- Returns
True if this is a final state and False otherwise
- is_greater_than(other)[source]¶
Defines a (strict) partial ordering on the states.
Not all states are comparable. State transitions cannot violate this ordering.
- Parameters
other (JobState) – the other JobState to compare to
- Returns
if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.
- Return type
Miscellaneous¶
psij.utils module¶
- class SingletonThread(name=None, daemon=False)[source]¶
Bases:
Thread
A convenience class to return a thread that is guaranteed to be unique to this process.
This is intended to work with fork() to ensure that each os.getpid() value is associated with at most one thread. This is not safe. The safe thing, as pointed out by the fork() man page, is to not use fork() with threads. However, this is here in an attempt to make it slightly safer for when users really really want to take the risk against all advice.
This class is meant as an abstract class and should be used by subclassing and implementing the run method.
Instantiation of this class or one of its subclasses should be done through the
get_instance()
method rather than directly.- Parameters
- Return type
None
psij.version module¶
This module stores the current version of this library.
Descriptor¶
- class Descriptor(name, version, cls, aliases=None, nice_name=None)[source]¶
Bases:
object
This class is used to enable PSI/J to discover and register executors and/or launchers.
Executors wanting to register with PSI/J must place an instance of this class in a global module list named __PSI_J_EXECUTORS__ or __PSI_J_LAUNCHERS__ in a module placed in the psij-descriptors namespace package. In other words, in order to automatically register an executor or launcher, a python file should be created inside a psij-descriptors package, such as:
<project_root>/ src/ psij-descriptors/ descriptors_for_project.py
It is essential that the psij-descriptors package not contain an __init__.py file in order for Python to treat the package as a namespace package. This allows Python to combine multiple psij-descriptors directories into one, which, in turn, allows PSI/J to detect and load all descriptors that can be found in Python’s library search path.
The contents of descriptors_for_project.py could then be as follows:
from packaging.version import Version from psij.descriptor import Descriptor __PSI_J_EXECUTORS__ = [ Descriptor(name=<name>, version=Version(<version_str>), cls=<fqn_str>), ... ] __PSI_J_LAUNCHERS__ = [ Descriptor(name=<name>, version=Version(<version_str>), cls=<fqn_str>), ... ]
where <name> stands for the name used to instantiate the executor or launcher, <version_str> is a version string such as 1.0.2, and <fqn_str> is the fully qualified class name that implements the executor or launcher such as psij.executors.local.LocalJobExecutor.
- Parameters
name (str) – The name of the executor or launcher. The automatic registration system will register the executor or launcher using this name. That is, the executor or launcher represented by this descriptor will be available for instantiation using either
get_instance()
orget_instance()
version (Version) – The version of the executor/launcher. Multiple versions can be registered under a single name.
cls (str) – A fully qualified name pointing to the class implementing an executor or launcher.
aliases (Optional[List[str]]) – An optional set of alternative names to make the executor available under as if its name was the alias.
nice_name (Optional[str]) – An optional string to use whenever a user-friendly name needs to be displayed to a user. For example, a nice name for pbs would be PBS or Portable Batch System. If not specified, the nice_name defaults to the value of the name parameter.
- Return type
None
Exceptions¶
psij.exceptions module¶
A collection of exceptions used by PSI/J.
- exception InvalidJobException(message, exception=None)[source]¶
Bases:
Exception
An exception describing a problem with a job specification.
- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- exception SubmitException(message, exception=None, transient=False)[source]¶
Bases:
Exception
An exception representing job submission issues.
This exception is thrown when the
submit()
call fails for a reason that is independent of the job that is being submitted.- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- transient¶
Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.
Executors¶
The concrete executor implementations provided by this version of PSI/J Python are:
Cobalt¶
- class CobaltJobExecutor(url=None, config=None)[source]¶
Bases:
BatchSchedulerExecutor
A
JobExecutor
for the Cobalt Workload Manager.The Cobalt HPC Job Scheduler, is used by Argonne’s ALCF systems.
Uses the
qsub
,qstat
, andqdel
commands, respectively, to submit, monitor, and cancel jobs.Creates a batch script with #COBALT directives when submitting a job.
Custom attributes prefixed with cobalt. are rendered as long-form directives in the script. For example, setting custom_attributes[‘cobalt.m’] = ‘co’ results in the #COBALT –m=co directive being placed in the submit script.
- Parameters
url (Optional[str]) – This parameter is not used and is only provided for compatibility reasons.
config (Optional[CobaltExecutorConfig]) – An optional configuration for this executor.
- Return type
None
CobaltExecutorConfig¶
- class CobaltExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]¶
Bases:
BatchSchedulerExecutorConfig
A configuration class for the Cobalt executor.
- Parameters
launcher_log_file (Optional[Path]) – See
JobExecutorConfig
.work_directory (Optional[Path]) – See
JobExecutorConfig
.queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.
Flux¶
- class FluxJobExecutor(url=None, config=None)[source]¶
Bases:
JobExecutor
A
JobExecutor
for the Flux scheduler.The Flux resource manager framework is deployed and used on a per-user basis at many sites, and is slated to become the system-level resource manager at LLNL.
Uses Flux’s python library/bindings to submit, monitor, and manipulate jobs.
- Parameters
url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[JobExecutorConfig]) – The FluxJobExecutor does not have any configuration options.
- Return type
None
Local¶
- class LocalJobExecutor(url=None, config=None)[source]¶
Bases:
JobExecutor
A job executor that runs jobs locally using
subprocess.Popen
.This job executor is intended to be used either to run jobs directly on the same machine as the PSI/J library or for testing purposes.
Note
In Linux, attached jobs always appear to complete with a zero exit code regardless of the actual exit code.
Warning
Instantiation of a local executor from both parent process and a fork()-ed process is not guaranteed to work. In general, using fork() and multi-threading in Linux is unsafe, as suggested by the fork() man page. While PSI/J attempts to minimize problems that can arise when fork() is combined with threads (which are used by PSI/J), no guarantees can be made and the chances of unexpected behavior are high. Please do not use PSI/J with fork(). If you do, please be mindful that support for using PSI/J with fork() will be limited.
- Parameters
url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (JobExecutorConfig) – The LocalJobExecutor does not have any configuration options.
- Return type
None
LSF¶
- class LsfJobExecutor(url, config=None)[source]¶
Bases:
BatchSchedulerExecutor
A
JobExecutor
for the LSF Workload Manager.The IBM Spectrum LSF workload manager is the system resource manager on LLNL’s Sierra and Lassen, and ORNL’s Summit.
Uses the ‘bsub’, ‘bjobs’, and ‘bkill’ commands, respectively, to submit, monitor, and cancel jobs.
Creates a batch script with #BSUB directives when submitting a job.
Renders all custom attributes of the form lsf.<name> into the corresponding LSF directive. For example, setting job.spec.attributes.custom_attributes[‘lsf.core_isolation’] = ‘0’ results in a `#BSUB -core_isolation 0 directive being placed in the submit script.
- Parameters
url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[LsfExecutorConfig]) – An optional configuration for this executor.
LsfExecutorConfig¶
- class LsfExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]¶
Bases:
BatchSchedulerExecutorConfig
A configuration class for the LSF executor.
- Parameters
launcher_log_file (Optional[Path]) – See
JobExecutorConfig
.work_directory (Optional[Path]) – See
JobExecutorConfig
.queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.
PBS Pro¶
- class PBSJobExecutor(url=None, config=None)[source]¶
Bases:
GenericPBSJobExecutor
A
JobExecutor
for PBS Pro and friends.This executor uses resource specifications specific to PBS Pro
- Parameters
url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[PBSExecutorConfig]) – An optional configuration for this executor.
PBS Classic¶
- class PBSClassicJobExecutor(url=None, config=None)[source]¶
Bases:
GenericPBSJobExecutor
A
JobExecutor
for classic PBS systems.This executor uses resource specifications specific to Open PBS. Specifically, this executor uses the -l nodes=n:ppn=m way of specifying nodes, which differs from the scheme used by PBS Pro.
- Parameters
url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[PBSExecutorConfig]) – An optional configuration for this executor.
Radical Pilot¶
- class RPJobExecutor(url=None, config=None)[source]¶
Bases:
JobExecutor
A job executor that runs jobs via the RADICAL Pilot system.
- Parameters
url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[JobExecutorConfig]) – The RPJobExecutor does not have any configuration options.
- Return type
None
Slurm¶
- class SlurmJobExecutor(url=None, config=None)[source]¶
Bases:
BatchSchedulerExecutor
A
JobExecutor
for the Slurm Workload Manager.The Slurm Workload Manager is a widely used resource manager running on machines such as NERSC’s Perlmutter, as well as a variety of LLNL machines.
Uses the ‘sbatch’, ‘squeue’, and ‘scancel’ commands, respectively, to submit, monitor, and cancel jobs.
Creates a batch script with #SBATCH directives when submitting a job.
Renders all custom attributes set on a job’s attributes with a slurm. prefix into corresponding Slurm directives with long-form parameters. For example, job.spec.attributes.custom_attributes[‘slurm.qos’] = ‘debug’ causes a directive #SBATCH –qos=debug to be placed in the submit script.
- Parameters
url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[SlurmExecutorConfig]) – An optional configuration for this executor.
SlurmExecutorConfig¶
- class SlurmExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]¶
Bases:
BatchSchedulerExecutorConfig
A configuration class for the Slurm executor.
- Parameters
launcher_log_file (Optional[Path]) – See
JobExecutorConfig
.work_directory (Optional[Path]) – See
JobExecutorConfig
.queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.
Executor Infrastructure¶
JobExecutor¶
- class JobExecutor(url=None, config=None)[source]¶
Bases:
ABC
An abstract base class for all JobExecutor implementations.
- Parameters
url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.
config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.
- abstract cancel(job)[source]¶
Cancels a job that has been submitted to underlying executor implementation.
A successful return of this method only indicates that the request for cancellation has been communicated to the underlying implementation. The job will then be canceled at the discretion of the implementation, which may be at some later time. A successful cancellation is reflected in a change of status of the respective job to
CANCELED
. User code can synchronously wait until theCANCELED
state is reached using job.wait(JobState.CANCELED) or even job.wait(), since the latter would wait for all final states, including JobState.CANCELED. In fact, it is recommended that job.wait() be used because it is entirely possible for the job to complete before the cancellation is communicated to the underlying implementation and before the client code receives the completion notification. In such a case, the job will never enter the CANCELED state and job.wait(JobState.CANCELED) would hang indefinitely.- Parameters
job (Job) – The job to be canceled.
- Raises
SubmitException – Thrown if the request cannot be sent to the underlying implementation.
- Return type
None
- static get_executor_names()[source]¶
Returns a set of registered executor names.
Names returned by this method can be passed to
get_instance()
as the name parameter.
- static get_instance(name, version_constraint=None, url=None, config=None)[source]¶
Returns an instance of a JobExecutor.
- Parameters
name (str) – The name of the executor to return. This must be one of the values returned by
get_executor_names()
. If the value of the name parameter is not one of the valid values returned byget_executor_names()
, ValueError is raised.version_constraint (Optional[str]) – A version constraint for the executor in the form ‘(‘ <op> <version>[, <op> <version[, …]] ‘)’, such as “( > 0.0.2, != 0.0.4)”.
url (Optional[str]) – An optional URL to pass to the JobExecutor instance.
config (Optional[JobExecutorConfig]) – An optional configuration to pass to the instance.
- Returns
A JobExecutor.
- Return type
- abstract list()[source]¶
List native IDs of all jobs known to the backend.
This method is meant to return a list of native IDs for jobs submitted to the backend by any means, not necessarily through this executor or through PSI/J.
- static register_executor(desc, root)[source]¶
Registers a JobExecutor class through a
Descriptor
.The class can then be later instantiated using
get_instance()
.- Parameters
desc (Descriptor) – A
Descriptor
with information about the executor to be registered.root (str) – A filesystem path under which the implementation of the executor is to be loaded from. Executors from other locations, even if under the correct package, will not be registered by this method. If an executor implementation is only available under a different root path, this method will throw an exception.
- Return type
None
- set_job_status_callback(cb)[source]¶
Registers a status callback with this executor.
The callback can either be a subclass of
JobStatusCallback
or a procedure accepting two arguments: aJob
and aJobStatus
.The callback will be invoked whenever a status change occurs for any of the jobs submitted to this job executor, whether they were submitted with an individual job status callback or not. To remove the callback, set it to None.
- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallback
or a callable with two parameters: job of typeJob
and job_status of typeJobStatus
.- Return type
None
- abstract submit(job)[source]¶
Submits a Job to the underlying implementation.
Successful return of this method indicates that the job has been sent to the underlying implementation and all changes in the job status, including failures, are reported using notifications. Conversely, if one of the two possible exceptions is thrown, then the job has not been successfully sent to the underlying implementation, the job status remains unchanged, and no status notifications about the job will be fired.
A successful return of this method guarantees that the job’s native_id property is set.
- Raises
InvalidJobException – Thrown if the job specification cannot be understood. This exception is fatal in that submitting another job with the exact same details will also fail with an InvalidJobException. In principle, the underlying implementation / LRM is the entity ultimately responsible for interpreting a specification and reporting any errors associated with it. However, in many cases, this reporting may come after a significant delay. In the interest of failing fast, library implementations should make an effort of validating specifications early and throwing this exception as soon as possible if that validation fails.
SubmitException – Thrown if the request cannot be sent to the underlying implementation. Unlike InvalidJobException, this exception can occur for reasons that are transient.
- Parameters
job (Job) –
- Return type
None
- property version: packaging.version.Version¶
Returns the version of this executor.
JobExecutorConfig¶
- class JobExecutorConfig(launcher_log_file=None, work_directory=None)[source]¶
Bases:
object
An abstract configuration class for
JobExecutor
instances.- Parameters
launcher_log_file (Optional[Path]) – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
work_directory (Optional[Path]) – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
- Return type
None
- DEFAULT: JobExecutorConfig = <psij.job_executor_config.JobExecutorConfig object>¶
A default JobExecutorConfig used when none is specified.
- DEFAULT_WORK_DIRECTORY = PosixPath('/home/runner/.psij/work')¶
The default work directory when a work directory is not explicitly specified.
- property launcher_log_file: Optional[Path]¶
Configure the executor’s launcher log file.
- Parameters
launcher_log_file – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
- property work_directory: Path¶
Configure the execor’s work directory.
- Parameters
work_directory – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
psij.executors.batch.batch_scheduler_executor module¶
- class BatchSchedulerExecutor(url=None, config=None)[source]¶
Bases:
JobExecutor
A base class for batch scheduler executors.
This class implements a generic
JobExecutor
that interacts with batch schedulers. There are two main components to the executor: job submission and queue polling. Submission is implemented by generating a submit script which is then fed to the queuing system submit command.The submit script is generated using a
generate_submit_script()
. An implementation of this functionality based on Mustache/Pystache (see https://mustache.github.io/ and https://pypi.org/project/pystache/) exists inTemplatedScriptGenerator
. This class can be instantiated by concrete implementations of a batch scheduler executor and the submit script generation can be delegated to that instance, which has a method whose signature matches that ofgenerate_submit_script()
. Besides an opened file which points to where the contents of the submit script are to be written, the parameters togenerate_submit_script()
are theJob
that is being submitted and a context, which is a dictionary with the following structure:{ 'job': <the job being submitted> 'psij': { 'lib': <dict; function library>, 'launch_command': <str; launch command>, 'script_dir': <str; directory where the submit script is generated> } }
The script directory is a directory (typically ~/.psij/work) where submit scripts are written; it is also used for auxiliary files, such as the exit code file (see below) or the script output file.
The launch command is a list of strings which the script generator should render as the command to execute. It wraps the job executable in the proper
Launcher
.The function library is a dictionary mapping function names to functions for all public functions in the
template_function_library
module.The submit script must perform two essential actions:
1. redirect the output of the executable part of the script to the script output file, which is a file in <script_dir> named <native_id>.out, where <native_id> is the id given to the job by the queuing system.
2. store the exit code of the launch command in the exit code file named <native_id>.ec, also inside <script_dir>.
Additionally, where appropriate, the submit script should set the environment variable named
PSIJ_NODEFILE
to point to a file containing a list of nodes that are allocated for the job, one per line, with a total number of lines matching the process count of the job.Once the submit script is generated, the executor renders the submit command using
get_submit_command()
and executes it. Its output is then parsed usingjob_id_from_submit_output()
to retrieve the native_id of the job. Subsequently, the job is registered with the queue polling thread.The queue polling thread regularly polls the batch scheduler queue for updates to job states. It builds the command for polling the queue using
get_status_command()
, which takes a list of native_id strings corresponding to all registered jobs. Implementations are strongly encouraged to restrict the query of job states to the specified jobs in order to reduce the load on the queuing system. The output of the status command is then parsed usingparse_status_output()
and the status of each job is updated accordingly. If the status of a registered job is not found in the output of the queue status command, it is assumed completed (or failed, depending on its exit code), since most queuing systems automatically purge completed jobs from their databases after a short period of time. The exit code is read from the exit code file, as described above. If the exit code value is not zero, the job is assumed failed and an attempt is made to read an error message from the script output file.- Parameters
url (Optional[str]) – An optional URL pointing to a specific backend
config (Optional[BatchSchedulerExecutorConfig]) – An configuration for this executor instance; if none is specified, a default configuration is used.
- attach(job, native_id)[source]¶
Attaches a job to a native job.
Attempts to connect job to a native job with native_id such that the job correctly reflects updates to the status of the native job. If the native job was previously submitted using this executor (hence having an exit code file and a script output file), the executor will attempt to retrieve the exit code and errors from the job. Otherwise, it may be impossible for the executor to distinguish between a failed and successfully completed job.
- cancel(job)[source]¶
Cancels a job if it has not otherwise completed.
A command is constructed using
get_cancel_command()
and executed in order to cancel the job. Also seecancel()
.- Parameters
job (Job) –
- Return type
None
- abstract generate_submit_script(job, context, submit_file)[source]¶
Called to generate a submit script for a job.
Concrete implementations of batch scheduler executors must override this method in order to generate a submit script for a job.
- Parameters
job (Job) – The job to be submitted.
context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see the description of this class.
submit_file (IO[str]) – An opened file-like object to which the contents of the submit script should be written.
- Return type
None
- abstract get_cancel_command(native_id)[source]¶
Constructs a command to cancel a batch scheduler job.
Concrete implementations of batch scheduler executors must override this method.
- abstract get_list_command()[source]¶
Constructs a command to retrieve the list of jobs known to the LRM for the current user.
Concrete implementations of batch scheduler executors must override this method. Upon running the command, the output can be parsed with
parse_list_output()
.
- abstract get_status_command(native_ids)[source]¶
Constructs a command to retrieve the status of a list of jobs.
Concrete implementations of batch scheduler executors must override this method. In order to prevent overloading the queueing system, concrete implementations are strongly encouraged to return a command that only queries for the status of the indicated jobs. The command returned by this method should produce an output that is understood by
parse_status_output()
.- Parameters
jobs – A collection of native ids corresponding to the jobs whose status is sought.
native_ids (Collection[str]) –
- Returns
A list of strings representing the command and arguments to execute in order to get the status of the jobs.
- Return type
- abstract get_submit_command(job, submit_file_path)[source]¶
Constructs a command to submit a job to a batch scheduler.
Concrete implementations of batch scheduler executors must override this method.
- Parameters
job (Job) – The job being submitted.
submit_file_path (Path) – The path to a submit script generated using
generate_submit_script()
.
- Returns
A list of strings representing the command and arguments to execute in order to submit the job, such as [‘qsub’, str(submit_file_path)].
- Return type
- abstract job_id_from_submit_output(out)[source]¶
Extracts a native job id from the output of the submit command.
Concrete implementations of batch scheduler executors must override this method. This method is only invoked if the submit command completes with a zero exit code, so implementations of this method do not need to determine whether the output reflects an error from the submit command.
- list()[source]¶
Returns a list of jobs known to the underlying implementation.
See
list()
. The returned list is a list of native_id strings representing jobs known to the underlying batch scheduler implementation, whether submitted through this executor or not. Implementations are encouraged to restrict the results to jobs accessible by the current user.
- parse_list_output(out)[source]¶
Parses the output of the command obtained from
get_list_command()
.The default implementation of this method assumes that the output has no header and consists of native IDs, one per line, possibly surrounded by whitespace. Concrete implementations should override this method if a different format is expected.
- Parameters
out (str) – The output from the “list” command as returned by
get_list_command()
.- Returns
A list of strings representing the native IDs of the jobs known to the LRM for the current user.
- Return type
- abstract parse_status_output(exit_code, out)[source]¶
Parses the output of a job status command.
Concrete implementations of batch scheduler executors must override this method. The output is meant to have been produced by the command generated by
get_status_command()
.- Parameters
out (str) – The string output of the status command as prescribed by
get_status_command()
.exit_code (int) –
- Returns
A dictionary mapping native job ids to
JobStatus
objects. The implementation of this method need not process the exit code file or the script output file since it is done by the base BatchSchedulerExecutor implementation.- Return type
- abstract process_cancel_command_output(exit_code, out)[source]¶
Handle output from a failed cancel command.
The main purpose of this method is to help distinguish between the cancel command failing due to an invalid job state (such as the job having completed before the cancel command was invoked) and other types of errors. Since job state errors are ignored, there are two options:
1. Instruct the cancel command to not fail on invalid state errors and have this method always raise a
SubmitException
, since it is only invoked on “other” errors.2. Have the cancel command fail on both invalid state errors and other errors and interpret the output from the cancel command to distinguish between the two and raise the appropriate exception.
- Parameters
- Raises
InvalidJobStateError – Raised if the job cancellation has failed because the job was in a completed or failed state at the time when the cancellation command was invoked.
SubmitException – Raised for all other reasons.
- Return type
None
- class BatchSchedulerExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]¶
Bases:
JobExecutorConfig
A base configuration class for
BatchSchedulerExecutor
implementations.When subclassing
BatchSchedulerExecutor
, specific configuration classes inheriting from this class should be defined, even if empty.- Parameters
launcher_log_file (Optional[Path]) – See
JobExecutorConfig
.work_directory (Optional[Path]) – See
JobExecutorConfig
.queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.
- exception InvalidJobStateError[source]¶
Bases:
Exception
An exception that signals that a job cannot be cancelled due to it being already done.
- check_status_exit_code(command, exit_code, out)[source]¶
Check if exit_code is nonzero and, if so, raise a RuntimeError.
This function produces a somewhat user-friendly exception message that combines the command that was run with its output.
BatchSchedulerExecutor¶
- class BatchSchedulerExecutor(url=None, config=None)[source]
Bases:
JobExecutor
A base class for batch scheduler executors.
This class implements a generic
JobExecutor
that interacts with batch schedulers. There are two main components to the executor: job submission and queue polling. Submission is implemented by generating a submit script which is then fed to the queuing system submit command.The submit script is generated using a
generate_submit_script()
. An implementation of this functionality based on Mustache/Pystache (see https://mustache.github.io/ and https://pypi.org/project/pystache/) exists inTemplatedScriptGenerator
. This class can be instantiated by concrete implementations of a batch scheduler executor and the submit script generation can be delegated to that instance, which has a method whose signature matches that ofgenerate_submit_script()
. Besides an opened file which points to where the contents of the submit script are to be written, the parameters togenerate_submit_script()
are theJob
that is being submitted and a context, which is a dictionary with the following structure:{ 'job': <the job being submitted> 'psij': { 'lib': <dict; function library>, 'launch_command': <str; launch command>, 'script_dir': <str; directory where the submit script is generated> } }
The script directory is a directory (typically ~/.psij/work) where submit scripts are written; it is also used for auxiliary files, such as the exit code file (see below) or the script output file.
The launch command is a list of strings which the script generator should render as the command to execute. It wraps the job executable in the proper
Launcher
.The function library is a dictionary mapping function names to functions for all public functions in the
template_function_library
module.The submit script must perform two essential actions:
1. redirect the output of the executable part of the script to the script output file, which is a file in <script_dir> named <native_id>.out, where <native_id> is the id given to the job by the queuing system.
2. store the exit code of the launch command in the exit code file named <native_id>.ec, also inside <script_dir>.
Additionally, where appropriate, the submit script should set the environment variable named
PSIJ_NODEFILE
to point to a file containing a list of nodes that are allocated for the job, one per line, with a total number of lines matching the process count of the job.Once the submit script is generated, the executor renders the submit command using
get_submit_command()
and executes it. Its output is then parsed usingjob_id_from_submit_output()
to retrieve the native_id of the job. Subsequently, the job is registered with the queue polling thread.The queue polling thread regularly polls the batch scheduler queue for updates to job states. It builds the command for polling the queue using
get_status_command()
, which takes a list of native_id strings corresponding to all registered jobs. Implementations are strongly encouraged to restrict the query of job states to the specified jobs in order to reduce the load on the queuing system. The output of the status command is then parsed usingparse_status_output()
and the status of each job is updated accordingly. If the status of a registered job is not found in the output of the queue status command, it is assumed completed (or failed, depending on its exit code), since most queuing systems automatically purge completed jobs from their databases after a short period of time. The exit code is read from the exit code file, as described above. If the exit code value is not zero, the job is assumed failed and an attempt is made to read an error message from the script output file.- Parameters
url (Optional[str]) – An optional URL pointing to a specific backend
config (Optional[BatchSchedulerExecutorConfig]) – An configuration for this executor instance; if none is specified, a default configuration is used.
- attach(job, native_id)[source]
Attaches a job to a native job.
Attempts to connect job to a native job with native_id such that the job correctly reflects updates to the status of the native job. If the native job was previously submitted using this executor (hence having an exit code file and a script output file), the executor will attempt to retrieve the exit code and errors from the job. Otherwise, it may be impossible for the executor to distinguish between a failed and successfully completed job.
- cancel(job)[source]
Cancels a job if it has not otherwise completed.
A command is constructed using
get_cancel_command()
and executed in order to cancel the job. Also seecancel()
.- Parameters
job (Job) –
- Return type
None
- abstract generate_submit_script(job, context, submit_file)[source]
Called to generate a submit script for a job.
Concrete implementations of batch scheduler executors must override this method in order to generate a submit script for a job.
- Parameters
job (Job) – The job to be submitted.
context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see the description of this class.
submit_file (IO[str]) – An opened file-like object to which the contents of the submit script should be written.
- Return type
None
- abstract get_cancel_command(native_id)[source]
Constructs a command to cancel a batch scheduler job.
Concrete implementations of batch scheduler executors must override this method.
- abstract get_list_command()[source]
Constructs a command to retrieve the list of jobs known to the LRM for the current user.
Concrete implementations of batch scheduler executors must override this method. Upon running the command, the output can be parsed with
parse_list_output()
.
- abstract get_status_command(native_ids)[source]
Constructs a command to retrieve the status of a list of jobs.
Concrete implementations of batch scheduler executors must override this method. In order to prevent overloading the queueing system, concrete implementations are strongly encouraged to return a command that only queries for the status of the indicated jobs. The command returned by this method should produce an output that is understood by
parse_status_output()
.- Parameters
jobs – A collection of native ids corresponding to the jobs whose status is sought.
native_ids (Collection[str]) –
- Returns
A list of strings representing the command and arguments to execute in order to get the status of the jobs.
- Return type
- abstract get_submit_command(job, submit_file_path)[source]
Constructs a command to submit a job to a batch scheduler.
Concrete implementations of batch scheduler executors must override this method.
- Parameters
job (Job) – The job being submitted.
submit_file_path (Path) – The path to a submit script generated using
generate_submit_script()
.
- Returns
A list of strings representing the command and arguments to execute in order to submit the job, such as [‘qsub’, str(submit_file_path)].
- Return type
- abstract job_id_from_submit_output(out)[source]
Extracts a native job id from the output of the submit command.
Concrete implementations of batch scheduler executors must override this method. This method is only invoked if the submit command completes with a zero exit code, so implementations of this method do not need to determine whether the output reflects an error from the submit command.
- list()[source]
Returns a list of jobs known to the underlying implementation.
See
list()
. The returned list is a list of native_id strings representing jobs known to the underlying batch scheduler implementation, whether submitted through this executor or not. Implementations are encouraged to restrict the results to jobs accessible by the current user.
- parse_list_output(out)[source]
Parses the output of the command obtained from
get_list_command()
.The default implementation of this method assumes that the output has no header and consists of native IDs, one per line, possibly surrounded by whitespace. Concrete implementations should override this method if a different format is expected.
- Parameters
out (str) – The output from the “list” command as returned by
get_list_command()
.- Returns
A list of strings representing the native IDs of the jobs known to the LRM for the current user.
- Return type
- abstract parse_status_output(exit_code, out)[source]
Parses the output of a job status command.
Concrete implementations of batch scheduler executors must override this method. The output is meant to have been produced by the command generated by
get_status_command()
.- Parameters
out (str) – The string output of the status command as prescribed by
get_status_command()
.exit_code (int) –
- Returns
A dictionary mapping native job ids to
JobStatus
objects. The implementation of this method need not process the exit code file or the script output file since it is done by the base BatchSchedulerExecutor implementation.- Return type
- abstract process_cancel_command_output(exit_code, out)[source]
Handle output from a failed cancel command.
The main purpose of this method is to help distinguish between the cancel command failing due to an invalid job state (such as the job having completed before the cancel command was invoked) and other types of errors. Since job state errors are ignored, there are two options:
1. Instruct the cancel command to not fail on invalid state errors and have this method always raise a
SubmitException
, since it is only invoked on “other” errors.2. Have the cancel command fail on both invalid state errors and other errors and interpret the output from the cancel command to distinguish between the two and raise the appropriate exception.
- Parameters
- Raises
InvalidJobStateError – Raised if the job cancellation has failed because the job was in a completed or failed state at the time when the cancellation command was invoked.
SubmitException – Raised for all other reasons.
- Return type
None
BatchSchedulerExecutorConfig¶
- class BatchSchedulerExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]
Bases:
JobExecutorConfig
A base configuration class for
BatchSchedulerExecutor
implementations.When subclassing
BatchSchedulerExecutor
, specific configuration classes inheriting from this class should be defined, even if empty.- Parameters
launcher_log_file (Optional[Path]) – See
JobExecutorConfig
.work_directory (Optional[Path]) – See
JobExecutorConfig
.queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.
SubmitScriptGenerator¶
- class SubmitScriptGenerator(config)[source]¶
Bases:
ABC
A base class representing a submit script generator.
A submit script generator is used to render a
Job
(together with all its properties, includingJobSpec
,ResourceSpec
, etc.) into a submit script specific to a certain batch scheduler.- Parameters
config (JobExecutorConfig) – An executor configuration containing configuration properties for the executor that is attempting to use this generator. Submit script generators are meant to work in close cooperation with batch scheduler job executors, hence the sharing of a configuration mechanism.
- Return type
None
- generate_submit_script(job, context, out)[source]¶
Generates a job submit script.
Concerete implementations of submit script generators must implement this method. Its purpose is to generate the content of the submit script. For an extensive explanation of the mechanism behind this process, see
BatchSchedulerExecutor
.- Parameters
job (Job) – The job for which the submit script is to be generated.
context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see
BatchSchedulerExecutor
.out (IO[str]) – An opened file-like object to which the contents of the submit script should be written.
- Return type
None
TemplatedScriptGenerator¶
- class TemplatedScriptGenerator(config, template_path, escape=<function bash_escape>)[source]¶
Bases:
SubmitScriptGenerator
A Mustache templates submit script generator.
This script generator uses Pystache (https://pypi.org/project/pystache/), which is a Python implementation of the Mustache templating language (https://mustache.github.io/).
- Parameters
config (JobExecutorConfig) – A configuration, which is passed to the base class.
template_path (Path) – The path to a Mustache template.
escape (Callable[[object], str]) – An escape function to use for escaping values. By default, a function that escapes strings for use in bash scripts is used.
- Return type
None
psij.executors.batch.template_function_library module¶
- ALL: Dict[str, Callable[[...], Any]] = {'walltime_to_minutes': <function walltime_to_minutes>}¶
A dictionary of all template-accessible functions for the batch executor templating mechanism.
The dictionary which maps function names to their implementation. All public functions in this module are present in this dictionary and their corresponding keys are the same as their names.
Launchers¶
aprun¶
- class AprunLauncher(config=None)[source]¶
Bases:
MultipleLauncher
Launches a job using Cobalt’s
aprun
.- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration.
jrun¶
- class JsrunLauncher(config=None)[source]¶
Bases:
MultipleLauncher
Launches a job using LSF’s
jsrun
.- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration.
mpirun¶
- class MPILauncher(config=None)[source]¶
Bases:
MultipleLauncher
Launches jobs using
mpirun
.mpirun
is a tool provided by MPI implementations, such as Open MPI.- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration.
multiple¶
- class MultipleLauncher(script_path=PosixPath('/home/runner/work/psij-python/psij-python/src/psij/launchers/scripts/multi_launch.sh'), config=None)[source]¶
Bases:
ScriptBasedLauncher
A launcher that launches multiple identical copies of the executable.
The exit code of the job corresponds to the first non-zero exit code encountered in one of the executable copies or zero if all invocations of the executable succeed.
- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration.
script_path (Path) –
single¶
- class SingleLauncher(config=None)[source]¶
Bases:
ScriptBasedLauncher
A launcher that launches a single copy of the executable. This is the default launcher.
- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration.
srun¶
- class SrunLauncher(config=None)[source]¶
Bases:
MultipleLauncher
Launches a job using Slurm’s
srun
.See the Slurm Workload Manager.
- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration.
Launcher Infrastructure¶
Launcher¶
- class Launcher(config=None)[source]¶
Bases:
ABC
An abstract base class for all launchers.
- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration. If not specified,
DEFAULT
is used.- Return type
None
- static get_instance(name, version_constraint=None, config=None)[source]¶
Returns an instance of a launcher optionally configured using a certain configuration.
The returned instance may or may not be a singleton object.
- abstract get_launch_command(job)[source]¶
Constructs a command to launch a job given a job specification.
- abstract get_launcher_failure_message(output)[source]¶
Extracts the launcher error message from the output of this launcher’s invocation.
It is understood that the value of the output parameter is such that
is_launcher_failure()
returns True on it.
- static get_launcher_names()[source]¶
Returns a set of registered launcher names.
Names returned by this method can be passed to
get_instance()
as the name parameter.
- abstract is_launcher_failure(output)[source]¶
Determines whether the launcher invocation output contains a launcher failure or not.
- static register_launcher(desc, root)[source]¶
Registers a launcher class.
The registered class can then be instantiated using
get_instance()
.- Parameters
desc (Descriptor) – A
Descriptor
with information about the launcher to register.root (str) – A filesystem path under which the implementation of the launcher is to be loaded from. Launchers from other locations, even if under the correct package, will not be registered by this method. If a launcher implementation is only available under a different root path, this method will throw an exception.
- Return type
None
ScriptBasedLauncher¶
- class ScriptBasedLauncher(script_path, config=None)[source]¶
Bases:
Launcher
A launcher that uses a script to start the job, possibly by wrapping it in other tools.
This launcher is an abstract base class for launchers that wrap the job in a script. The script must be a bash script and is invoked with the first four parameters as:
the job ID
a launcher log file, which is taken from the launcher_log_file configuration setting and defaults to /dev/null
the pre- and post- launcher scripts, or empty strings if they are not specified
Additional positional arguments to the script can be specified by subclasses by overriding the
get_additional_args()
method.The remaining arguments to the script are the job executable and arguments.
A simple script library is provided in scripts/launcher_lib.sh. Its use is optional and it is intended to be included at the beginning of a main launcher script using source $(dirname “$0”)/launcher_lib.sh. It does the following:
sets ‘-e’ mode (exit on error)
sets the variables _PSI_J_JOB_ID, _PSI_J_LOG_FILE, _PSI_J_PRE_LAUNCH, and _PSI_J_POST_LAUNCH from the first arguments, as specified above.
saves the current stdout and stderr in descriptors 3 and 4, respectively
redirects stdout and stderr to the log file, while prepending a timestamp and the job ID to each line
defines the commands “pre_launch” and “post_launch”, which can be invoked by the main script.
When invoking the job executable (either directly or through a launch command), it is recommended that the stdout and stderr of the job process be redirected to descriptors 3 and 4, respectively, such that they can be captured by the entity invoking the launcher rather than ending up in a the launcher log file.
A successful completion of the launcher should be signalled by the launcher by printing the string “_PSI_J_LAUNCHER_DONE” to stdout. The launcher can then exit with the exit code returned by the launched command. This allows executor to distinguish between a non-zero exit code due to application failure or due to a premature launcher failure.
The actual launcher scripts, as well as the library, are deployed at run-time into the work directory, where submit scripts are also generated. This directory is meant to be accessible by both the node submitting the job as well as the node launching the job.
- Parameters
script_path (Path) – A path to a script that is invoked as described above.
config (Optional[JobExecutorConfig]) – An optional configuration.
- Return type
None
- get_additional_args(job)[source]¶
Returns any additional arguments, after first mandatory four, to be passed to the script.
- get_launch_command(job, log_file=None)[source]¶
See
get_launch_command()
.