The ExaWorks project is a community-led project under the umbrella of the Exascale Computing Project which brings together a number of high level HPC tools developed by the members of ExaWorks. We noticed that most of these projects, as well as many of the community projects, implemented a software layer/library to interact with HPC schedulers in order to insulate the core functionality from the details of how things are specified for each scheduler. We also noticed that the respective libraries were mostly limited to schedulers running on resources that each team had access to. We used our combined knowledge to design a single API/library for this goal, one that would be tested on all resources that all ExaWorks teams have access to. We then shared this API and library so that all high level HPC tools can benefit from it.
A major difficulty in maintaining HPC software tools is that access to HPC resources is generally limited to a small number of clusters local to each team. Additionally, HPC resources vary widely depending on the institution that maintains them. Consequently, software that is tested on resources that a HPC tool development team has access to is likely to encounter problems on other HPC resources. A first step in addressing this problem is by pooling teams' respective resources for testing purposes. PSI/J takes it a step further by exposing an infrastructure that allows any user PSI/J user to easily contribute test results to PSI/J, and do so automatically. This is mutually beneficial: the PSI/J community gains assurance that PSI/J functions correctly on a wide range of resources, while users contributing tests have a mechanism to ensure that the PSI/J team is aware of potential problems specific to their resources and can address them, thus ensuring that PSI/J continues to function correctly on those resources.
SAGA — an OGF standard — abstracts away the specificity of diverse queue systems, offering a consistent representation of jobs and of the capabilities required to submit them to the resources. RADICAL-SAGA implements a subset of the SAGA standard, exposing a homogeneous programming interface to the batch systems of HPC and HTC resources, enabling job submissions and file transfers across resources. RADICAL-SAGA is used by RADICAL-Pilot to acquire resources and staging data to and from HPC platforms. PSI/J will replace RADICAL-SAGA, offering streamlined capabilities and a modern implementation.
The PSI/J API is in no small part inspired by the Java CoG Kit Abstraction API. The Swift workflow system made extensive use of and improved on the Java Cog Kit Abstraction API. While the original Swift (Swift/K) is not maintained any more, its two spiritual successors, Swift/T and Parsl are.