Prototype Data Discovery Service – Analysis

Prototype Data Discovery Service – Analysis

Objective 

An investigation into the components for a prototype data discovery analysis service (PDDAS) that could be provided by DiRAC.

Summary of work undertaken 

An assessment of the possibilities for utilising;

  • Atempo tape archives at the Durham and Edinburgh sites
  • Virgo cosmological database at Durham
  • Sciserver instance under preparation at Durham
  • Hecuba

Summary of a number of high impact PI requirements for such a service.

Outputs

Investigation of a Prototype Data Discovery Service – Final Report

Meta Data Addition – Edinburgh

Meta Data Addition – Edinburgh

Objective 

An assessment of interfacing the DiRAC DCS infrastructure with the ongoing efforts to renew the International Lattice Data Grid (ILDG). The aim of this piece was to consolidate the UK lattice community involvement in the wider international data curation efforts for lattice QCD data

Summary of work undertaken 

INCOMING

Outputs

Meta Data Addition – Durham

Meta Data Addition – Durham

Objective 

To investigate the challenges in adding metadata to large data sets. To do so, initially the following two tools sere set up as a proof of concept

  1. A tool to automatically assign physical quantities, dimensions and units to data produced by an application.
  2. A tool to read data from files with binary formats and add meta data in a wrapper to these files.
Summary of work undertaken 

Both the above tools have been created and full detail is in the below report.

Outputs

Investigation of Meta Data addition Durham – Final Report

FAIR Assessment

FAIR Assessment

Objective 

An Assessment of FAIR Compliance (Findability, Accessibility, Interoperability, and Reusability) of sets of (meta)data and digital research objects used within 3 exemplar DiRAC projects.

Summary of work undertaken 

A FAIR Assessment was carried out against three projects within the DiRAC umbrella, Exemplar demonstrator projects from the ExoMol, Virgo, and United Kingdom Quantum chromodynamics (UKQCD) collaborations were assessed, which scored 100%, 95% and 80% FAIR using the DANS SATIFYD tool, respectively. The results of these FAIR assessments will inform recommendations to the projects on what they need to implement to make their assets more FAIR, and will inform guidance and documentation for future DiRAC Federation users to be delivered in a future stage of the project.

Data Curation

Data Curation

Objective 

This work package explored several of the challenges around the longer term storage, accessibility and usability of the (extremely large) data sets which are typically generated within DiRAC.

Summary of work undertaken 

This work package consisted of the following discrete sub-work packages, each of which explored one aspect of data curation;

Kickstart Your HPC Journey (Cluster Challenge)

Kickstart Your HPC Journey (Cluster Challenge)

Objective 

To produce a guide and supporting suite of resources as a toolkit to enable the set up and running of an HPC cluster challenge by any institution in the community.   

Summary of work undertaken 

A 2-day student cluster challenge event was hosted at University College London. Through this, all resources needed to facilitate a cluster challenge were created, including advertising and promotion material, example agenda and timetable, and technical challenges at two different levels of difficulty. Practical co-ordination advice, including training pre-requisites, hardware, helpers etc. was gathered, and a lesson learned document was created.

Outputs

A GitHub repository of all “Kickstart your HPC Journey” resources needed for planning and hosting a cluster challenge.  These are expected to be available for release to the community later 2022/early 2023.

Innovation Placements

Innovation Placements

Eight student placements were funded by this project, which delivered the following pieces of research:

Solar Flare Prediction
Objective

To build a tool for space weather and solar flare prediction.

Summary of work undertaken

In this project, a tool was created for space weather analysis and, in particular, solar flare forecasting. More precisely a code was developed that takes solar magnetogram observations (of magnetic fields emerging into the solar atmosphere) as input and calculates a series of measures related to the topology of the solar magnetic field. Particular signatures of these measures can be used for flare forecasting and compared to other satellite data.

More information can be found here.

Natural Language Processing for Work Order Classification
Objective

Apply Natural Language Processing to classify work orders in order to automate the process of reviewing work order information done by Senior Engineers on two London Underground rail lines.

Summary of work undertaken

Investigated how to handle large data sets, and application of methods such as text pre-processing, lemmatization, etc. Used MLFlow to experiment and optimise logistic regression models using this data, achieving significant improvements in performance through various methods of data cleaning and manipulation.

Natural Language Processing Applied to Engineering Team Documents
Objective

To use Natural Language Processing against an Industry Partner’s Asset data, to provide a better understanding on asset performance.

Summary of work undertaken

Used Deep Learning Natural Language classifiers and transferred learning to label the text generated by the partners engineering teams, which provided a better understanding on asset performance, and a better understanding of the costs involved in running the maintenance of assets.

HES Health Episode Statistics Database
Objective

To remove systematic errors from the NHS HES Health Episode Statistics database.

Summary of work undertaken

Machine Learning was used to identify systemic errors and biases in the HES database. These were removed and a cleaner version of the HES database was produced and made available for analysis and interpretation.

Deprivation Indicators on Asthma in Young Adults
Objective

To measure the effect of deprivation indicators on asthma in young adults.

Summary of work undertaken

The Project used data from the IWCH data vault to develop and use a patient centred analytics platform by applying analytics and ML methods to the data to compare whether the main features that affect young adult asthma sufferers are linked to deprivation indicators.

Learning Health System for Children 

NHS Institute for Women’s and
Children’s Health (IWCH)

Objective

To improve the learning health system for children.

Summary of work undertaken

The project built on a 10-year programme of work in the Children and Young People’s Health Partnership (CYPHP) to deliver story boards for platform and dashboard construction to enable population health management for children with long term conditions.

Mapped and catalogued data flows for the afferent and efferent arms of Learning Health System Generic dashboards for clinical and population health management of children that can be adapted for specific long-term conditions. Place-based data maps illustrating areas of health need and risks were also produced.

Enabling Machine Learning Hybrid Simulations and Uncertainty Quantification 
Objective

To produce a new Scientific Machine Learning Benchmark.

Summary of work undertaken

A new Radiative Transfer Machine Learning Neural Network Application was produced and added to the SciML machine learning benchmark suite.

Quantum Computing 
Goal 

To study the algorithm suggested by Jordan et al (https://arxiv.org/abs/1112.4833) and write a first implementation for the QLM using myQLM.

Summary of work undertaken

Following a learning phase to gain familiarisation of the details of Jordan’s algorithm and an understanding of the basics of Quantum Computing and myQLM, an algorithm for the simplest case of a scalar field theory was implemented. Training materials for an Introduction to Quantum Computing workshop were developed and a first iteration of the course was delivered in September 2022.

Community Workshops for Common Workflows

Community Workshops for Common Workflows

Objective

As a component of  DiRAC’s Federation Project, in February 2022 two workshops took place in London in which DiRAC reached out to computational scientists in other fields to share experiences and explore the extent to which common workflows could be the basis of defining UKRI-wide computing services in the future UKRI Digital Research Infrastructure.

Summary of work undertaken

Each workshop took place at the Royal College of Physicians:

The first on 9th February focused on Memory Intensive (MI) Workflows (ie. memory-bound problems such as computational cosmology currently served by the COSMA facility in Durham)

The second  on 23rd February focused on Extreme Scaling (ES) Workflows (ie. problems requiring tightly-coupled systems of cpus/gpus such as lattice QCD currently served by the Tursa facility in Edinburgh)

The workshops featured presentations and discussions prompted by the question “what kind of machine is ideally suited to your problem?”

Output

The output is the in-person workshops from which two reports were generated and can be found below:

The workshop presentation slides can be viewed below:

Workshop on Memory Intensive Workflows in Scientific Computing

Simon Hands, Mark Wilkinson, Ed Bennett

Alastair Basden, Aidan Chalk, Matthieu Schaller,
Debora Sijacki

Ian Bush, Peter Coveney, Sergei Dudarev, Alin-Marin Elena, Phil Hasnip, Scott Woodley

Ben Rogers, Stephen Longshaw, Spencer Sherwin

Parashkev Nachev, Robert Gray

Nils Wedi, Tobias Weinzierl

Rob Akers, Andy Davis, Shaun DeWitt, Stan Pamela

Workshop on Extreme Scaling Workflows in Scientific Computing

Simon Hands, Mark Wilkinson, Ed Bennett

Luigi del Debbio, Biagio Lucini,
Antonin Portelli, James Richings

Vassil Alexandrov, Alin-Marin Elena,
Dimitar Pashov, Andrea Townsend-Nicholson, Scott Woodley

Charles Laughton

Pier Luigi Vidale, Nils Wedi

Max Boleininger, James Cook, Andy Davis,
Shaun DeWitt, Leo Ma, Joseph Parker

Artificial Intelligence/Machine Learning Training Materials

Artificial Intelligence/Machine Learning Training Materials

Objective 

To create a body of practical hands-on training examples devoted to AI/ML methodologies and their application to real-world science problems. The target audience for the materials was the entire spectrum of DiRAC users, and the aim was to expose them to, and assist them in incorporating, these techniques into their coding to produce new science.

Summary of work undertaken 

SCiML were engaged to work alongside DiRAC’s Training Team to produce a bank of practical hands-on worked examples demonstrating the implementation of various AI/ML methodologies and their application to real-world science problems.

The worked examples covered:

  • DiRAC Science areas (STFC Frontier Physics)
  • Materials Modelling
  • Fusion
  • National Health Service (simulated data sets)

with each using real or simulated data sets appropriate to the field.

Outputs

A bank of 26 practical worked examples (python notebooks and data sets) and a Powerpoint slide for each example, which will be used to demonstrate the AI/ML methodologies in a taught lecture component.

The first iteration of this course is expected in late 2022/early 2023.

Foundation HPC-Skills Training Course

Foundation HPC-Skills Training Course

Objective 

Update the existing generic DiRAC HPC-Skills (Essentials Level) course to provide DiRAC-specific learning material, accompanying instructor-led teaching, and a bank of multiple-choice questions suitable for an online final assessment.

This work provided DiRAC-specific HPC-Skills training materials that are better structured for delivery to, and consumption by both new users and the wider science community.

Summary of work undertaken 

The Software Sustainability Institute (SSI) were engaged to work alongside DiRAC’s Training Team to select and tailor existing, open-source material (from the Software Carpentry Foundation) to DiRAC’s compute resources.

Outputs 

A series of 6 Modules consisting of lecture-based and self-paced materials for the DiRAC core HPC-Skills training portfolio, accompanying introductory videos and a question bank for assessment.

The first presentation of this course to users is expected in late 2022/early 2023.