Foundation HPC-Skills Training Course

Foundation HPC-Skills Training Course Oct-Nov 2023

This course will cover the fundamental skills to enable you to use a DiRAC system. Including a version control system like Git, a workload manager like Slurm, and introducing concepts in software engineering.

Objective 

The learner will be able to:

  • Use the basic tools of the Unix environment, file management, and common editors
  • Implement a command script
  • Use Gitl tool
  • Understand the principles of software design and testing
  • Use tools to demonstrate good networking practice.
  • be able to submit a simple job script
  • Understand the principles of code scaling
Summary of work undertaken 

The course will run from the 10th of October to the 14th of Nov. In the afternoon from 12:30 BST to 16:30 BST (11:30 – 15:30 UTC) each day.

Registration 

Applications will close on the 29th of September. Please note new DiRAC users will be prioritised, and allotted on a first come basis.

HPC-Skills Foundation course Oct-Nov 2023
Are you currently a DiRAC user?

MPI Library

MPI Library

Objective 

Produce a digital repository for the sharing and archiving of benchmarking data for key DiRAC codes.

Summary of work undertaken 

A wiki was created within the DiRAC instance of the confluence package, currently hosted at the University of Edinburgh. This was created as a long-term repository for benchmarking data from the DiRAC systems. The wiki is intended to be updated as and when new benchmark results are available, for example during procurement activities or when new versions of applications are introduced. The repository has space for detailed information and comments from the benchmark runners to highlight special features of the run. It also has space for MPI profiling information from the run.

Initial data from existing benchmark runs has been loaded into the wiki.

Outputs

Creation of an MPI Library – Final Report

Automated Benchmarks (Reframe)

Automated Benchmarks (Reframe)

Objective 

To utilise Reframe as a single wrapper for the suite of existing DiRAC and UCL Tier2 benchmarks – with the aim of providing a single set of benchmarks that can be run as needed following system upgrades.

Summary of work undertaken 

The following were successfully added to Reframe:

  • The benchmarks for Swift and Grid
  • The benchmarks for CP2K
  • The benchmarks for HPGMG, IMB, and Sombrero (the latter is a mini-app for Swift)

Work on benchmarks that has been progressed, but as yet not completed due to the technical challenges:

  • Ramses, Sphng, and Trove
Outputs

Implementation of Reframe for Benchmarks – Final Report

Artificial Intelligence/Machine Learning Benchmark

Artificial Intelligence/Machine Learning Benchmark

Objective 

To create a self-contained AI Benchmark/workflow in the domain of synthetic brain imaging.

Summary of work undertaken 

Training epochs were run for three provided model configurations on the UCL AI platform on both a single GPU and multiple GPU devices. Several multiple day runs of ~100 epochs were run. As training scripts were configured to run for 100,000 epochs and each epoch takes around an hour or more to run, ‘full’ runs of the model were not performed.

Python requirements and the package associated with the code were installed on the Cambridge HPC service (following the same set up process documented in the repository README below).

Outputs

A Public GitHub repository containing the open-source (GPL v3) release of the research code developed by Kings College London. The GitHub repository (https://github.com/r-gray/3d_very_deep_vae) has a GPL v3 license file included.

The README file in the repository contains full details of how to install the dependencies and Python package, and includes platform dependent requirement specifications with pinned versions for the support operating system and Python version combinations. There is also documentation on how to run the model training with the example configurations provided.

Development of an Artificial Intelligence-Machine Learning Benchmark – Final Report

Advanced Application & Systems Performance Analysis Tools

Advanced Application & Systems Performance Analysis Tools

Objective 

To produce an application that monitors workload usage of hardware components.

Summary of work undertaken 

The assets from the Cloud Road-testing for UKRI Workloads work package were extended to ensure every platform has monitoring to get visibility into how well a workload is making use of the hardware assigned to that specific platform.

The Jupyter Notebook and Linux machine platforms both ran an isolated Prometheus Node Exporter and Grafana stack. A similar stack was run on Slurm, alongside Slurm specific information, such as the current jobs being run. 

Outputs

OpenStack Cloud Dashboard – Azimuth was modified to link to a Grafana that can provide insights into the users current usage and resource allocations. (Although, there are still some missing links in making a full end-to-end prototype.)

DiRAC Wide Dashboard – there is no working prototype for a DiRAC wide dashboard, but architecturally it was shown how this could be adopted for each site, and then aggregated centrally, using the same technologies.

Authorisation Module

Authorisation Module

Objective 

To re-engineer an existing Authorisation application to be suitable for use within the DiRAC ecosphere.

Summary of work undertaken 

A Data Access Controller was developed with the intention of each HPC service hosting an instance of this service. It will be responsible for querying instances of the IG App to enable local users to prove that they have permission to access a locally stored dataset by virtue of their participation in the project owning it.

An end-to-end workflow was successfully demonstrated, whereby users were able to register datasets from the Information Governance app and create new shared directories on the local filesystem, and that permissions on those directories were automatically updated in response to changes in the IG App.

Outputs

Investigation of an Authorisation Module – Final Report

One API Assessment

One API Assessment

Objective 

To explore and document the experiences of porting some representative codes to a OneAPI programming module.

Summary of work undertaken 

The project documented the experiences of porting some representative codes to one or other of two promising programming models: SYCL or OpenMP offload.  The programming models are supported by Intel OneAPI and other commercial and open source compilers.

Five candidate codes (OpenQCD, OpenMM, HemeLB, dGpoly3D, and AREPO) were selected, profiled and kernels were ported. (An absolute performance comparison between programming models was not a goal of this work.)

The experience of a group of research software engineers, most of whom were novices in SYCL or OpenMP GPU offload programming was examined.

Outputs

The final report from this piece of work is expected in Autumn 2022.

Closer

Closer

Objective 

To understand the multiple dimensions of prediction of concepts in social and biomedical science questionnaires.

Summary of work undertaken 

This work package extended the scope of the research tackled in the RCNIC project to:

  • Dive deeper into questions related to the size and quality of the training data and how this affects the performance of the designed ML models.
  • Assess the performance of the trained ML models for automated tagging of question texts with the top-level concept topics (16 in number) from existing thesauri such as European Language Social Science Thesaurus (ELSST) in ‘inference mode’, i.e. with new unseen questionnaires (that were not part of the training and validation set).
  • Investigate new ML models (such as hierarchical approaches) for tagging question texts (and response domains) with the 120 second-level topics from ELSST.
Outputs

Applying machine learning models to social or biomedical science questionnaires – Final Report

Decentralised Cloud Storage

Decentralised Cloud Storage

Objective 

Investigate the potential of using Storj Decentralised Cloud Service for scientific data sharing the Rucio scientific data management framework.

Summary of work undertaken 

– We shared scientific dataset using Storj DCS through Zenodo (e.g. https://doi.org/10.5281/zenodo.6369178).

– Performance for upload and download of data on Storj DCS using Rclone was monitored.

– We investigated how to configure Storj DCS as a storage element in Rucio, and performed a synthetic data transfer from a remote Rucio client.

Outputs

– Workflow and scripts to share data on Storj DCS.

– Performance report (https://www.storj.io/resource/university-of-edinburgh-performance-report)

– Documentation of configuration for Storj as a Rucio storage element (https://app.archbee.com/doc/Nf9hb4wvtgW8RxTodjWQ6/8IATlmHE_1fC09IMsHzRN)