M Lovell @cosma-support, Durham
Introduction
Accessing storage efficiently, in a way that users are able to store and manipulate their data with minimal interference from other users requesting the same resource, is a crucial element of modern HPC systems. In practice, it is common for heavily subscribed systems to have large read/write requests simultaneously, which places a lot of pressure on the file system with negative consequences for performance. In this blog post we describe our system for monitoring our Lustre file systems. They use a series of programs originally written in Go by HPE and now maintained by GSI in Darmstadt, with some small additions by the COSMA support team.
The COSMA Lustre file systems
The COSMA family of machines hosts five storage arrays: cosma8 (~18Pb), snap8 (1.1Pb), / cosma7 (3.5Pb), snap7 (0.43Pb) and cosma5 (1.55Pb).The cosma8, snap8, snap7 and cosma7 storage are DiRAC funded while cosma5 is for the exclusive use of Durham astrophysicists and their collaborators. All five systems have the same setup for Lustre monitoring but are processed entirely separately.
The Monitoring Map
Our goal is to monitor how data requests from jobs on compute nodes —broadly defined — are making requests of storage servers. This includes finding out which users are associated with each job, and then displaying the results in the Grafana web browser app. This process is summarised by the following map in which the types of server are indicated by colour — compute in green, storage in red, monitoring host server in orange — and the various processes are indicated as diamonds with an abbreviation of the process name inside.

Figure 1: map of the Prometheus lustre monitoring system. Rounded boxes denote servers (all bare metal) and grey diamonds denote processes. The processes key is given in the grey box.
The processes, broadly defined, are as follows:
- jobid_var (JV). Not strictly a process, this is the variable on the compute nodes that tells the storage server to store the jobs’ details for 15 minutes.
- lustre_exporter (LE-M, LE-O). This is a GSI-developed process, with one instance running on each storage server. The input flags are modified for metadata servers (LE-M) versus the object storage servers (LE-O). Monitoring statistics are made available to a P instance running on a dedicated monitoring server.
- Prometheus (P). This instance has four functions: i) receiving and storing the data from the storage servers; ii) providing data to the PCE (see below); iii) receiving and storing the results of PCE; iv) providing the results of both PCE and the LE-M/O to G.
- Prometheus_cluster_exporter (PCE). The results of LE-M/O as stored on P are read in and compared to the SLURM and GETENT databases to link users to jobs. The results are then fed back to P.
- Grafana (G). G reads data from P and displays it in a web app. It also sends email warnings during times of high use.
Now we describe the details for each of the five stages:
jobid_var
Computations are performed on login nodes, dedicated compute nodes, and a small number of dedicated interactive nodes. Each of these will make requests of the storage. In order to transmit information about the job to the storage servers, we set:

on all of our user-facing servers. Where a process is run as part of SLURM, which is the case for compute servers in our SLURM queues, the server will send the SLURM_JOB_ID; for other servers, which most of the time is the login nodes, it will send the user UID plus the process name. The results are stored in the job_stats file within lustre on the storage server.
lustre_exporter
A systemd instance of LE-M/LE-O runs on each of the storage servers, parsing the job_stats file. The cadence of the service is 15 seconds. We have modified the GSI-supplied software to also log the storage server load averages. This is not a part of lustre, but often a crucial component in diagnosing why access to the storage is poor.
Prometheus
The P instance performs multiple roles: request data from LE-M/LE-O, listen for requests from PCE, request results back from PCE, store the database, and make the database contents available to Grafana. Data are stored for 10 days, which limits the size of the largest database, that of COSMA8, to ~15Gb.
prometheus_cluster_exporter
PCE runs on the same server as P. It retrieves the information scarred by LE-M/LE-O from P, then runs:

and


Figure 2: map of current Grafana dashboards. The vertical dashed lines in the Job Metadata Operations Panel (top-left) denote regions in time where the associated warning has been activated.
Grafana
G receives the contents of the five P databases — one for each file system — and displays the database contents in its own dashboard. The five dashboards are only made available to systems administrators and not to users. We describe the contents of these dashboards below, alongside a discussion of how they can be used to diagnose abuse of the system.
Each dashboard features nine panels. The first three panels in the left-hand column cover the metadata operations rate, read throughput and write throughput respectively of SLURM jobs, which the PCE sums by user. The counterpart right-hand panels shown the same properties for non-SLURM jobs, labelled as ‘Proc’ which are mostly interactive jobs run on the login nodes; these panels differentiate between processes as well as users. These six panels cover the full output of PCE.
The bottom three panels instead show data from LE-M/LE-O. The bottom left is the number of operations performed by each process — both SLURM and non-SLURM. The panel to its right shows the I/O rate summed by storage server. The final panel shows the 1-minute load averages of the storage servers, for which we adapted the LE-M/LE-O script.
The primary cause of disruption on the system has been excessive SLURM job metadata operations. We have therefore enabled a warning on the associated panel that sends an email alert when the number of operations over a period of time has exceeded a given threshold. We can then identify which users’ jobs are contributing to the heavy load and take action to relieve the pressure if necessary. We do not have alerts on any of the other panels, but can use them to provide context when other issues are presented on file system access.


