psij.executors.batch package¶
Submodules¶
psij.executors.batch.batch_scheduler_executor module¶
- class BatchSchedulerExecutor(url=None, config=None)[source]¶
Bases:
JobExecutorA base class for batch scheduler executors.
This class implements a generic
JobExecutorthat interacts with batch schedulers. There are two main components to the executor: job submission and queue polling. Submission is implemented by generating a submit script which is then fed to the queuing system submit command.The submit script is generated using a
generate_submit_script(). An implementation of this functionality based on Mustache/Pystache (see https://mustache.github.io/ and https://pypi.org/project/pystache/) exists inTemplatedScriptGenerator. This class can be instantiated by concrete implementations of a batch scheduler executor and the submit script generation can be delegated to that instance, which has a method whose signature matches that ofgenerate_submit_script(). Besides an opened file which points to where the contents of the submit script are to be written, the parameters togenerate_submit_script()are theJobthat is being submitted and a context, which is a dictionary with the following structure:{ 'job': <the job being submitted> 'psij': { 'lib': <dict; function library>, 'launch_command': <str; launch command>, 'script_dir': <str; directory where the submit script is generated> } }
The script directory is a directory (typically ~/.psij/work) where submit scripts are written; it is also used for auxiliary files, such as the exit code file (see below) or the script output file.
The launch command is a list of strings which the script generator should render as the command to execute. It wraps the job executable in the proper
Launcher.The function library is a dictionary mapping function names to functions for all public functions in the
template_function_librarymodule.The submit script must perform two essential actions:
1. redirect the output of the executable part of the script to the script output file, which is a file in <script_dir> named <native_id>.out, where <native_id> is the id given to the job by the queuing system.
2. store the exit code of the launch command in the exit code file named <native_id>.ec, also inside <script_dir>.
Additionally, where appropriate, the submit script should set the environment variable named
PSIJ_NODEFILEto point to a file containing a list of nodes that are allocated for the job, one per line, with a total number of lines matching the process count of the job.Once the submit script is generated, the executor renders the submit command using
get_submit_command()and executes it. Its output is then parsed usingjob_id_from_submit_output()to retrieve the native_id of the job. Subsequently, the job is registered with the queue polling thread.The queue polling thread regularly polls the batch scheduler queue for updates to job states. It builds the command for polling the queue using
get_status_command(), which takes a list of native_id strings corresponding to all registered jobs. Implementations are strongly encouraged to restrict the query of job states to the specified jobs in order to reduce the load on the queuing system. The output of the status command is then parsed usingparse_status_output()and the status of each job is updated accordingly. If the status of a registered job is not found in the output of the queue status command, it is assumed completed (or failed, depending on its exit code), since most queuing systems automatically purge completed jobs from their databases after a short period of time. The exit code is read from the exit code file, as described above. If the exit code value is not zero, the job is assumed failed and an attempt is made to read an error message from the script output file.- Parameters
url (Optional[str]) – An optional URL pointing to a specific backend
config (Optional[BatchSchedulerExecutorConfig]) – An configuration for this executor instance; if none is specified, a default configuration is used.
- attach(job, native_id)[source]¶
Attaches a job to a native job.
Attempts to connect job to a native job with native_id such that the job correctly reflects updates to the status of the native job. If the native job was previously submitted using this executor (hence having an exit code file and a script output file), the executor will attempt to retrieve the exit code and errors from the job. Otherwise, it may be impossible for the executor to distinguish between a failed and successfully completed job.
- cancel(job)[source]¶
Cancels a job if it has not otherwise completed.
A command is constructed using
get_cancel_command()and executed in order to cancel the job. Also seecancel().- Parameters
job (Job) –
- Return type
None
- abstract generate_submit_script(job, context, submit_file)[source]¶
Called to generate a submit script for a job.
Concrete implementations of batch scheduler executors must override this method in order to generate a submit script for a job.
- Parameters
job (Job) – The job to be submitted.
context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see the description of this class.
submit_file (IO[str]) – An opened file-like object to which the contents of the submit script should be written.
- Return type
None
- abstract get_cancel_command(native_id)[source]¶
Constructs a command to cancel a batch scheduler job.
Concrete implementations of batch scheduler executors must override this method.
- abstract get_list_command()[source]¶
Constructs a command to retrieve the list of jobs known to the LRM for the current user.
Concrete implementations of batch scheduler executors must override this method. Upon running the command, the output can be parsed with
parse_list_output().
- abstract get_status_command(native_ids)[source]¶
Constructs a command to retrieve the status of a list of jobs.
Concrete implementations of batch scheduler executors must override this method. In order to prevent overloading the queueing system, concrete implementations are strongly encouraged to return a command that only queries for the status of the indicated jobs. The command returned by this method should produce an output that is understood by
parse_status_output().- Parameters
jobs – A collection of native ids corresponding to the jobs whose status is sought.
native_ids (Collection[str]) –
- Returns
A list of strings representing the command and arguments to execute in order to get the status of the jobs.
- Return type
- abstract get_submit_command(job, submit_file_path)[source]¶
Constructs a command to submit a job to a batch scheduler.
Concrete implementations of batch scheduler executors must override this method.
- Parameters
job (Job) – The job being submitted.
submit_file_path (Path) – The path to a submit script generated using
generate_submit_script().
- Returns
A list of strings representing the command and arguments to execute in order to submit the job, such as [‘qsub’, str(submit_file_path)].
- Return type
- abstract job_id_from_submit_output(out)[source]¶
Extracts a native job id from the output of the submit command.
Concrete implementations of batch scheduler executors must override this method. This method is only invoked if the submit command completes with a zero exit code, so implementations of this method do not need to determine whether the output reflects an error from the submit command.
- list()[source]¶
Returns a list of jobs known to the underlying implementation.
See
list(). The returned list is a list of native_id strings representing jobs known to the underlying batch scheduler implementation, whether submitted through this executor or not. Implementations are encouraged to restrict the results to jobs accessible by the current user.
- parse_list_output(out)[source]¶
Parses the output of the command obtained from
get_list_command().The default implementation of this method assumes that the output has no header and consists of native IDs, one per line, possibly surrounded by whitespace. Concrete implementations should override this method if a different format is expected.
- Parameters
out (str) – The output from the “list” command as returned by
get_list_command().- Returns
A list of strings representing the native IDs of the jobs known to the LRM for the current user.
- Return type
- abstract parse_status_output(exit_code, out)[source]¶
Parses the output of a job status command.
Concrete implementations of batch scheduler executors must override this method. The output is meant to have been produced by the command generated by
get_status_command().- Parameters
out (str) – The string output of the status command as prescribed by
get_status_command().exit_code (int) –
- Returns
A dictionary mapping native job ids to
JobStatusobjects. The implementation of this method need not process the exit code file or the script output file since it is done by the base BatchSchedulerExecutor implementation.- Return type
- abstract process_cancel_command_output(exit_code, out)[source]¶
Handle output from a failed cancel command.
The main purpose of this method is to help distinguish between the cancel command failing due to an invalid job state (such as the job having completed before the cancel command was invoked) and other types of errors. Since job state errors are ignored, there are two options:
1. Instruct the cancel command to not fail on invalid state errors and have this method always raise a
SubmitException, since it is only invoked on “other” errors.2. Have the cancel command fail on both invalid state errors and other errors and interpret the output from the cancel command to distinguish between the two and raise the appropriate exception.
- Parameters
- Raises
InvalidJobStateError – Raised if the job cancellation has failed because the job was in a completed or failed state at the time when the cancellation command was invoked.
SubmitException – Raised for all other reasons.
- Return type
None
- class BatchSchedulerExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]¶
Bases:
JobExecutorConfigA base configuration class for
BatchSchedulerExecutorimplementations.When subclassing
BatchSchedulerExecutor, specific configuration classes inheriting from this class should be defined, even if empty.- Parameters
launcher_log_file (Optional[Path]) – See
JobExecutorConfig.work_directory (Optional[Path]) – See
JobExecutorConfig.queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.
- exception InvalidJobStateError[source]¶
Bases:
ExceptionAn exception that signals that a job cannot be cancelled due to it being already done.
psij.executors.batch.cobalt module¶
Defines a JobExecutor for the Cobalt resource manager.
- class CobaltExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]¶
Bases:
BatchSchedulerExecutorConfigA configuration class for the Cobalt executor.
- Parameters
launcher_log_file (Optional[Path]) – See
JobExecutorConfig.work_directory (Optional[Path]) – See
JobExecutorConfig.queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.
- class CobaltJobExecutor(url=None, config=None)[source]¶
Bases:
BatchSchedulerExecutorA
JobExecutorfor the Cobalt Workload Manager.The Cobalt HPC Job Scheduler, is used by Argonne’s ALCF systems.
Uses the
qsub,qstat, andqdelcommands, respectively, to submit, monitor, and cancel jobs.Creates a batch script with #COBALT directives when submitting a job.
Custom attributes prefixed with cobalt. are rendered as long-form directives in the script. For example, setting custom_attributes[‘cobalt.m’] = ‘co’ results in the #COBALT –m=co directive being placed in the submit script.
- Parameters
url (Optional[str]) – This parameter is not used and is only provided for compatibility reasons.
config (Optional[CobaltExecutorConfig]) – An optional configuration for this executor.
- Return type
None
- get_cancel_command(native_id)[source]¶
See
get_cancel_command().
- get_list_command()[source]¶
See
get_list_command().
- get_status_command(native_ids)[source]¶
See
get_status_command().- Parameters
native_ids (Collection[str]) –
- Return type
- get_submit_command(job, submit_file_path)[source]¶
See
get_submit_command().
- process_cancel_command_output(exit_code, out)[source]¶
See
process_cancel_command_output().This should be unnecessary because qdel only seems to fail on non-integer job IDs.
psij.executors.batch.escape_functions module¶
- bash_escape(o)[source]¶
Escape object to bash string.
Renders and escapes an object to a string such that its value is preserved when substituted in a bash script between double quotes. Numeric values are simply rendered without any escaping. Path objects are converted to absolute path and escaped. All other objects are converted to string and escaped.
psij.executors.batch.lsf module¶
Defines the LsfJobExecutor class and its config class.
- class LsfExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]¶
Bases:
BatchSchedulerExecutorConfigA configuration class for the LSF executor.
- Parameters
launcher_log_file (Optional[Path]) – See
JobExecutorConfig.work_directory (Optional[Path]) – See
JobExecutorConfig.queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.
- class LsfJobExecutor(url, config=None)[source]¶
Bases:
BatchSchedulerExecutorA
JobExecutorfor the LSF Workload Manager.The IBM Spectrum LSF workload manager is the system resource manager on LLNL’s Sierra and Lassen, and ORNL’s Summit.
Uses the ‘bsub’, ‘bjobs’, and ‘bkill’ commands, respectively, to submit, monitor, and cancel jobs.
Creates a batch script with #BSUB directives when submitting a job.
Renders all custom attributes of the form lsf.<name> into the corresponding LSF directive. For example, setting job.spec.attributes.custom_attributes[‘lsf.core_isolation’] = ‘0’ results in a `#BSUB -core_isolation 0 directive being placed in the submit script.
- Parameters
url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[LsfExecutorConfig]) – An optional configuration for this executor.
- get_cancel_command(native_id)[source]¶
See
get_cancel_command().bkillwill exit with an error set if the job does not exist or has already finished.
- get_list_command()[source]¶
See
get_list_command().
- get_status_command(native_ids)[source]¶
See
get_status_command().- Parameters
native_ids (Collection[str]) –
- Return type
- get_submit_command(job, submit_file_path)[source]¶
See
get_submit_command().
- parse_status_output(exit_code, out)[source]¶
-
Iterate through the RECORDS entry, grabbing JOBID and STAT entries, as well as any state-change reasons if present.
- process_cancel_command_output(exit_code, out)[source]¶
See
process_cancel_command_output().Check if the error was raised only because a job already exited.
psij.executors.batch.pbspro module¶
psij.executors.batch.script_generator module¶
- class SubmitScriptGenerator(config)[source]¶
Bases:
ABCA base class representing a submit script generator.
A submit script generator is used to render a
Job(together with all its properties, includingJobSpec,ResourceSpec, etc.) into a submit script specific to a certain batch scheduler.- Parameters
config (Optional[JobExecutorConfig]) – An executor configuration containing configuration properties for the executor that is attempting to use this generator. Submit script generators are meant to work in close cooperation with batch scheduler job executors, hence the sharing of a configuration mechanism.
- Return type
None
- generate_submit_script(job, context, out)[source]¶
Generates a job submit script.
Concerete implementations of submit script generators must implement this method. Its purpose is to generate the content of the submit script. For an extensive explanation of the mechanism behind this process, see
BatchSchedulerExecutor.- Parameters
job (Job) – The job for which the submit script is to be generated.
context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see
BatchSchedulerExecutor.out (IO[str]) – An opened file-like object to which the contents of the submit script should be written.
- Return type
None
- class TemplatedScriptGenerator(config, template_path, escape=<function bash_escape>)[source]¶
Bases:
SubmitScriptGeneratorA Mustache templates submit script generator.
This script generator uses Pystache (https://pypi.org/project/pystache/), which is a Python implementation of the Mustache templating language (https://mustache.github.io/).
- Parameters
config (Optional[JobExecutorConfig]) – A configuration, which is passed to the base class.
template_path (Path) – The path to a Mustache template.
escape (Callable[[object], str]) – An escape function to use for escaping values. By default, a function that escapes strings for use in bash scripts is used.
- Return type
None
psij.executors.batch.slurm module¶
- class SlurmExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]¶
Bases:
BatchSchedulerExecutorConfigA configuration class for the Slurm executor.
- Parameters
launcher_log_file (Optional[Path]) – See
JobExecutorConfig.work_directory (Optional[Path]) – See
JobExecutorConfig.queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.
- class SlurmJobExecutor(url=None, config=None)[source]¶
Bases:
BatchSchedulerExecutorA
JobExecutorfor the Slurm Workload Manager.The Slurm Workload Manager is a widely used resource manager running on machines such as NERSC’s Perlmutter, as well as a variety of LLNL machines.
Uses the ‘sbatch’, ‘squeue’, and ‘scancel’ commands, respectively, to submit, monitor, and cancel jobs.
Creates a batch script with #SBATCH directives when submitting a job.
Renders all custom attributes set on a job’s attributes with a slurm. prefix into corresponding Slurm directives with long-form parameters. For example, job.spec.attributes.custom_attributes[‘slurm.qos’] = ‘debug’ causes a directive #SBATCH –qos=debug to be placed in the submit script.
- Parameters
url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[SlurmExecutorConfig]) – An optional configuration for this executor.
- get_cancel_command(native_id)[source]¶
See
get_cancel_command().
- get_list_command()[source]¶
See
get_list_command().
- get_status_command(native_ids)[source]¶
See
get_status_command().- Parameters
native_ids (Collection[str]) –
- Return type
- get_submit_command(job, submit_file_path)[source]¶
See
get_submit_command().
psij.executors.batch.template_function_library module¶
- ALL: Dict[str, Callable[[...], Any]] = {'walltime_to_minutes': <function walltime_to_minutes>}¶
A dictionary of all template-accessible functions for the batch executor templating mechanism.
The dictionary which maps function names to their implementation. All public functions in this module are present in this dictionary and their corresponding keys are the same as their names.
Module contents¶
A package containing infrastructure for implementing batch scheduler executors.