Enhanced Data Streams from the Latest Accelerator Release
Altair® Accelerator™(2021.1.0 release) introduced the Streaming Data Service (SDS). This is a Kafka-compliant data stream that allows Accelerator to be connected directly to a Kafka message broker. This first release published real-time scheduler metrics and enables remote monitoring of the Accelerator main dispatch loop.
The new release (2021.2.0) builds on that by adding metrics for the compute hosts (taskers). This data is available in a new Kafka topic and can be charted alongside the existing scheduler metrics.
Collected data include the requested, available, used, and total values for cores, RAM, and slots. The current architecture uses the main server to collect the data from taskers and then publish as a series of per-tasker messages on the Kafka pipe. Another approach would be to have the taskers push the data directly to Kafka and this may be taken in a future release — the central publish model is easier to configure.
Here are a few examples of some possible charts. We’re using the Altair® Panopticon™ visualization system to do this. Panopticon connects directly to Kafka so we don’t need to worry about databases unless interested in longer-term storage of the data stream.
This is a tree map chart showing tasker status. Each rectangle represents a tasker, and the color denotes its state. For example, yellow is a full tasker, green is empty, and pink suspended. We’ve encoded the RAMTOTAL to be represented as the tile area, which allows us to get an immediate visual status of our largest (RAM) compute hosts and shows us how we're using big-memory systems. It’s possible to switch to a 'cores' or 'slots' view, too, or to have multiple tree maps present side by side.
Taskers are naturally represented by tables and Panopticon can help with that too.
The table keeps track of the key tasker metrics but augments them with some visual elements. Here we see that we have a good number of jobs running but we’re not packing the hosts.
Combining Multiple Metrics
The data stream contains both abstract job resources and what is being consumed on the tasker. For example, jobs may request 4G but use only 2G. Similarly, 4 cores could be requested but the load average may be 8, indicating an over-commitment of the tasker or underestimate of the cores needed.
While this data can be tracked at the job level, there’s an advantage to doing it at the tasker level, too. It may be more efficient for some jobs to underestimate their core needs because they may be packed on to a host that has other jobs running that overestimated their needs (or returned to a more single-threaded mode).
This chart shows RAM, L1-load, and cores used as three separate charts on a common time axis. It’s a stacked chart and each strip represents a single tasker. In the top chart we can see most taskers have over-requested RAM (light blue), with some accurately estimating in dark blue. We can see how this shifts over time. The single red strip identifies a tasker that had memory overcommit. We can drill into this strip and identify the tasker — using the tasker name and time window — and we can query the jobs database and identify which jobs were present.
Similarly, for the bottom chart we color the strips depending upon how accurately the load average compares with the requested cores. We use Panopticon’s ability to process time series data to generate a new stream Load Factor (a ratio of L5 load to cores requested) and use that to color the chart.
We’re generally more tolerant of core over-commit and we’ve used red to denote under-commit. On the far right we can see the transition from a workload that requested more cores than it needed (100 jobs requesting 4 cores – red peak) to one that used the correct core request of 2 cores. The peak is half the size (~200) and the shading is blue.
Horizon charts are another good way of looking at multiple time series data and seeing correlation across those streams. This chart has L5 load plotted for each tasker. We can see where and when things get busy.
Of course, we can do some fun stuff too. Here’s a chart showing taskers that have had a recent change in load with jobs.