Advanced Application & Systems Performance Analysis Tools

Objective 

To produce an application that monitors workload usage of hardware components.

Summary of work undertaken 

The assets from the Cloud Road-testing for UKRI Workloads work package were extended to ensure every platform has monitoring to get visibility into how well a workload is making use of the hardware assigned to that specific platform.

The Jupyter Notebook and Linux machine platforms both ran an isolated Prometheus Node Exporter and Grafana stack. A similar stack was run on Slurm, alongside Slurm specific information, such as the current jobs being run. 

Outputs

OpenStack Cloud Dashboard – Azimuth was modified to link to a Grafana that can provide insights into the users current usage and resource allocations. (Although, there are still some missing links in making a full end-to-end prototype.)

DiRAC Wide Dashboard – there is no working prototype for a DiRAC wide dashboard, but architecturally it was shown how this could be adopted for each site, and then aggregated centrally, using the same technologies.

Categories: DFED1