Modern Workload Management APIs (A New Era of Workload Management)

Fritz Ferstl
Fritz Ferstl New Altair Community Member
edited August 2022 in Altair HPCWorks

A New Era of Workload Management

Modern Workload Management APIs

Fritz Ferstl, fferstl@altair.com

image

HPC IT Modernization and APIs

IT infrastructure modernization is sweeping through High Performance Computing (HPC) data centers. This is driven by a new class of HPC hardware and software components, such as Graphical Processing Units (GPUs) and containers, as well as modern system management paradigms introduced by Cloud Computing, Hyperconverged Infrastructures and configuration management. Application Programming Interfaces (APIs) are an integral part of that change and need to evolve together with the infrastructure. This is true for HPC APIs in general and certainly also for one of the most crucial HPC components, the workload management system (WLMS).

RESTless Peace with GraphQL and Streaming

RESTful APIs have been the de-facto standard for interfacing with web-based services. This was also true for HPC workload management system interfaces. Many HPC workload managers have been offering RESTful interfaces including the industry leading Altair workload management products. Yet it is now time to overcome some of the nagging deficiencies of REST-APIs, which are –

  • They require polling for information. With REST alone there is no event notification capability
  • They are wasteful and inefficient. If looking for small subsets of information in a large dataset you must load the full set of data and then filter for the required information in the client, thus leading to unnecessary data transfers and low performance in the face of ever-increasing scale requirements
  • Specific queries for information (e.g. for a single job) require dedicated API end points – the more queries the more end points
  • The syntax being available to pass query details with the REST API URL is unwieldy and inflexible

GraphQL and Streaming APIs address these issues elegantly.

Query and Mutate with GraphQL

GraphQL gives full control about the data being transferred across the wire to the client. A GraphQL client will get exactly the desired data in the expected format. Even huge amounts of underlying data can be queried and very specific information can be retrieved efficiently.

GraphQL provides a multitude of querying and filter options, such as:

  • You can query for characteristics of attributes in the data schema, e.g., whether an attribute has a certain value (say a certain project name has been set in a project attribute) or whether an attribute has a certain size (e.g. a memory requirement is bigger than a desired value)
  • You can filter the transferred data such that only the attributes you are interested in are displayed, such as job ID, job name, job owner and submit time (see the output in the middle of the screen shot above for an example)
  • It is easy to support pagination, i.e., break down a large list of matches of a query into chunks that get transferred and processed one by one (e.g. get displayed in a UI when you switch between pages)

And you can also use GraphQL to make changes to the data on the server (i.e., “mutate” the data). You can, in particular, add, modify or remove more than one element with a single request. Thus, you can implement efficient bulk changes, something not possible with RESTful interfaces.

Stay on Top with GraphQL and Streaming

What is described above requires polling for information. In many cases it is not efficient to poll, but one would rather be notified as soon as information of interest is changing. This is possible both with GraphQL and with Streaming APIs with a sweet spot for particular use cases for each.

GraphQL provides so called Subscriptions allowing you to register for a change event and get the desired data (again filtered and trimmed appropriately) when that event has occurred. Let’s assume you are interested to know about where your jobs get started once they start. This is exactly the type of information you can get crisply from a GraphQL subscription without having to poll for it.

Streaming, on the other hand, provides a different avenue for staying on top of changes as they happen. You can register for different types of streams and this will render a constant flow of messages pertaining to this stream type. You could, for example, subscribe to a job status change stream and calculate statistics like average wait time or runtime across all jobs and their trend lines or you could keep track of such job events in a time series database for later analysis.

There is an overlap in use cases for both approaches but, generally speaking, event subscriptions cater best to react upon specific events while stream subscriptions are best used to process most if not all the data arriving in the stream.

Integration Heaven

It is great being empowered to implement a lot of functionality on top of APIs as flexible as GraphQL and Streaming and it is yet better if you do not have to write any code in order to get the functionality you are looking for. When it comes to GraphQL and Streaming, there is a plethora of third-party tools which can directly consume well formatted data from GraphQL and Streaming APIs, immediately delivering value to you.

By way of example, let us just look at Grafanacombined with time series databases like Prometheus. Integrated with a solid GraphQL and Streaming API implementation, these alone allow you to create insightful dashboards, potentially without any coding.

Security

Security in and around modern workload managers deserves its own article in this series so suffice it to say here that well suited implementation frameworks for GraphQL and Streaming APIs are easily integrated with standards like Open Identity and natively support the certification and encryption of the requests and payloads going across the wire. They are, therefore, an important building block of a modern HPC infrastructure security architecture.

GraphQL and Streaming in Altair HPC Products

At Altair we have begun shipping GraphQL and Streaming APIs in our leading-edge workload management systems

  • Altair® Accelerator™ -- best in class workload, workflow and license management for the Chip Design and Verification industry
  • Altair® Grid Engine® -- versatile and scalable workload management proven in industries such as Life Sciences, Oil and Gas or Finance
  • Altair® PBSProfessional® -- leading edge workload management for Manufacturing, Weather and Climate Simulations, Education and Government sectors

We have also been providing GraphQL APIs as part of our cloud Cloud Management product Suite Altair© NavOps™. Best of all is that we are providing the exact same APIs across our workload managers. Thus, it will be possible to interface with all our workload managers from a single code base, for example in a dashboard tool or a jobs submission and monitoring portal like Altair© Access™.

Over time we will add more and more functionality to these APIs to ensure they can encompass every bit of the key interaction profiles our customers have with our products. This will include rewriting the command-line interfaces to sit on top of these APIs, thereby using the same scalable implementation and proven security infrastructure no matter how you access our workload managers.

If not already in process, all HPC data centers should consider getting started on their infrastructure modernization process. Altair can assist with the products and technologies described in this article and in further articles of this series.