Nvidia Hackathon

Accelerating your code for the future

On the 29th & 30th of June DiRAC is holding a 2-day virtual hackathon, supported by Nvidia. Participants will have access to NVidia’s latest A100 cards, with at least 4 cards per team in a single node.

As A100 cards will be soon be available as part of the new DiRAC-3 services, this is an ideal opportunity to be ready to take advantage of them as soon as they come online this October at Cambridge and Edinburgh.

The hackathon will provide you with access to Nvidia’s expertise and allow you to explore the latest GPU hardware. With 6912 cores and up to 19.5TF per card, the A100 GPU cards have huge potential for research.

It is an opportunity to assess the potential of GPUs for your research in a safe and supportive environment and give you the required evidence needed for applications for the imminent 14th RAC call for DiRAC allocations.

We welcome newbies and old hands, and each team will have a DiRAC Research Software Engineer to support their progress. There will be an opportunity for training if required.

Further information about DiRAC Hackathons can be found here.

We would encourage any research team considering GPU acceleration for their software to attend.


This will be a virtual event, so all participants can work from the comfort of their own home, or anywhere in the world.


The Hackathon will be held over 2 days on the 29th & 30th of June. Basic CUDA C/C++ training can be made available prior, if required.


If you are interested, please complete an application form and submit it to richard.regan@durham.ac.uk by Friday 11th of June:


With thanks to Birmingham University for granting us access to the Baskerville Tier 2 HPC:


4x A100 40GB cards per node. The cards are on the HGX-100 board which provides NVlINK between the GPUs and PCIe-4 back to the host system. The host has 512GB RAM and there is some local NVMe scratch on each host as well.

All are equipped with HDR port (full 200Gb as PCIe-4) and 25Gb Ethernet.

AMD GPU Hackathon

Accelerating your code for the future

AMD and DiRAC have come together to put on a 3 day event to explore the potential of AMD GPUs for research. The event will focus on porting and optimising your exciting GPU code using the newest AMD development suit.


This will comprise of a morning of presentations, followed by in-depth supplemental self-paced on-line materials. Topic covered will be a basic introduction to ROCm, porting CUDA code, and using multi GPU cards.


This will be a virtual event, so all participants can work from the comfort of there own home, or any where in the world. You will be given access to AMDs very new cluster, and up to date cluster.

Each team will be allocated there own GPU server with up to six of the latest AMD GPU cards. AMD experts from around the world will be on hand to help you get the most out of your code.

MI100 accelerators supported by AMD ROCm™, the industry’s first open software platform, offer customers an open platform that helps eliminate vendor lock-in, enabling developers to enhance existing GPU codes to run everywhere. Combined with the award winning AMD EPYC™ processors and AMD Infinity Fabric™ technology, MI100-powered systems provide scientists and researchers platforms that propel discoveries today and prepare them for exascale tomorrow.” AMD.com


The event will be held as a 1/2 day training session on the 21st of January, then a week later on the 28th-29th, a 2 day hackathon with support from AMD experts. There will be access to computer systems after the training day to prepare everyone prior to the event.


If interested please fill in application form and submit it to richard.regan@durham.ac.uk by Monday the 4th of January.

Intel Hackathon – Training Day

This full day workshop will focus on performance analysis. A short overview on the Intel Parallel Studio will be given. Compiler options are crucial for optimal performance. A short overview on optimization flags will be provided. Performance bottlenecks and how to detect them will be discussed and given attention to as preparation to hands on sessions.

The training day will be on the 2nd of September, and will be open to all attendees of the Intel hackathon.


Students will learn about analysis types for HPC and Threading analysis as well as micro architectural analysis. We are also addressing general tuning methodologies, common parallel bottlenecks and how to solve them.


Basic understanding of parallel programming paradigms and C/C++ or Fortran programming


9:4510:15Intel® Parallel Studio 2020 Overview – outlook on oneAPI
10:1511:30Intel® Compiler Overview
Application Performance Snapshot (APS)
First step on code analysis – points user to other tools
12:0013:00Lunch Break


Intel MPI
New features including extended support for MPI and Threads.


Intel® Advisor Introduction including Roofline Analysis
Vectorization Analysis and estimate on optimization potential
14:4515:15Coffee Break
15:1515:45Demo Advisor Roofline
Estimating optimization potential


Intel® VTune™ Amplifier
Most powerful analysis tool for profiling on Intel Hardware. Some simplified features also available in APS.
17:0017:30Wrap-Up – Questions and Answers

We look forward to see you there.

Intel 2020 Hackathon

Optimisation and workflows for the future.

This years DiRAC day will be preceded with a hackathon sponsored by Intel. The event will focus on how to optimise your exciting code using the newest Intel development suit, including Intel’s new oneAPI that corresponds to the open industry specification set by oneapi.com.

So if you are just interested in getting the best out of your C/C++ code, or if you are interested in looking to offloading parts of your code to some accelerator, this is the place to be.

The event will be held as a 1 day training session on the 2nd of September, then a week later on the 8th-9th, a 2 day hackathon with support from Intel experts. There will be access to computer systems from the training day to sometime after the event.

If interested please fill in application form and submit it to richard.regan@durham.ac.uk by Tuesday 25th of August Extended.

ARM/Mellanox Hackacthon

This was our first joint DiRAC-ARM Mellanox Hackathon held in September 2019 at the University of Leicester  (prior to the DiRAC Day activities). It was open to all users but targeted at DiRAC groups with greatest readiness to investigate the possibilities with Mellanox’s BlueField technology. This 3-day event structured as a mixture of expert presentations and user code development time. Provided an introduction to BlueField and the ARM development environment. The event ended with presentations by the teams of their results at DiRAC Day Conference.

In total six teams from the DiRAC community attended. It was a great success with all teams reported that they would participate in any future hackathon. Detail about the teams and results are below.


Lead by Rosie Talbotemail: Rt421@cam.ac.ukCodea

AREPO is a massively parallel, cosmological, magneto hydrodynamics simulation including N-body solver for gravity and hydrodynamics solved on a moving mesh. The code can simulate a wide range of astrophysical systems from modelling representative volumes of the Universe down to understanding planet formation.

AREPO is written in C and includes communication instructions using standardisedMPI. Additionally, it uses open source libraries: GSL, GMP, FFTW and HDF5.

AREPO has been heavily optimized, tests of the scaling capabilities and parallelization of the code show excellent performance. From benchmarking by the DiRAC facility, compared to other cosmological codes it is one of the best performers available in terms of speed and has a highly parallel I/O which performs at < 0.5% of the total run time. The code has highly optimized domain decomposition and work balance algorithms and has been shown to scale well up to thousands of cores.

We are open to explore various possibilities to improve code performance through Mellanox technologies. The code has a large number of MPI calls/functions that couldinstantly benefit from Mellanox SHARP while it may also be possible to offload subgridphysics modules or a subset of functions to the Bluefield chips.

In speeding up the code, it will have an impact across all of our research areas which span understanding star formation in dwarf galaxies to modelling the cosmic evolution of large-scale structure.

We are open to using both. Closed source, but info available: http://www.mpa-garching.mpg.de/~volker/arepo/


lead by Thomas Guillet: T.A.Guillet@exeter.ac.uk

The AREPO code simulates general self-gravitating fluid dynamics for astrophysics, and is used to study a wide range of problems, including the formation and evolution of galaxies, turbulence in the interstellar medium, the formation of the first stars or the interaction of black holes with their surrounding galactic environment. AREPO-DG is a specific experimental solver of the AREPO code implementing a high-order discontinuous Galerkin (DG) method on adaptive mesh refinement (AMR) grids, applied to ideal compressible magnetohydrodynamics (MHD) flows (http://adsabs.harvard.edu/abs/2019MNRAS.485.4209G).

The AREPO code is used by a number of research groups in Germany (Munich,Heidelberg, Potsdam), the UK (Cardiff, Durham), and the US (Harvard, MIT, Chicago). AREPO-DG is experimental at this stage, and primarily used and developed by the primary contact.

AREPO-DG is compute-intensive due to its high-order scheme. Most operations are small dense linear algebra operations, benefitting from vector units. However, memory bandwidth is also important, because at intermediate scheme orders (most useful in practice in astrophysics) the arithmetic intensity of the scheme is around 5-10 FP ops/byte, which is still stressing the DRAM bandwidth on most x86 architectures. Some numerical ingredients in the code are also more sensitive to DRAM bandwidth.

I would expect the ThunderX2 architecture to help make the code closer to compute bound, thanks to its increased architectural byte/FLOP, after some cache optimizations in the code. I am also interested in exploring how the Mellanox Bluefield technology can improve extreme scaling of the code.

Any performance improvement obtained over the hackathon will directly benefit upcoming AREPO-DG simulations of MHD turbulence, that are being prepared in conjunction with corresponding DiRAC and STFC proposals. These simulations aim at understanding theamplification of magnetic fields in turbulent flows, a candidate mechanism to explain the ubiquitous magnetic fields observed today in the Universe. In addition, these simulations will help understand the numerical properties, performance characteristics, and implementation techniques of high-order schemes, and will contribute to their wider application to astrophysics, but also to broader domains of computational fluid dynamics where DG schemes are gaining traction.

TEAM: DiRAxion

Lead by: Asier Lopez-Eiguren: asier.lopezeiguren@helsinki.fi

Our code solves a second order differential equation in a 3D discrete lattice in order to analyse the evolution of a scalar field with N components. The main code ON is composed by 4000 lines of code and the Latfield2d library is composed by 6000 lines, both of them are written in C++ and can be downloaded from the links below.


As we described in our DiRAC-2.5y Director’s Discretionary time application bigger simulations are necessary close a debate about axion dark matter production from axion strings. We were able to run 4k3 in Dial but bigger simulations will improve even further the finite-size scaling analysis. In order to achieve this goal, we have to improve the performance of our code and we think that the Hackathon will help us.

The improvement of the ON code will help us to analyse in more detail the axion strings. Axion strings are an important source of dark matter axions, and their density feeds into the calculation of the axion DM relic density. Moreover, the improvement of the Latfield2d library will help to enlarge field theoretical simulations and therefore will help to solve many problems related with the finite-size effects.

TEAM: GRChombo

Lead by Dr Kacper Kornet: kk562@damtp.cam.ac.uk

AMR code for general relativity simulations. Built using C++, MPI, HDF5. For best performance GRChombo can also us Intel intrinsics

Currently typical GRChombo runs on Dirac systems use 16-32 KNL nodes with around 90GB of memory per rank.

Check performance characteristic of the code on ARM architecture. Especially learn how to vectorize a templated C++ code. That would probably require equivalent of Intel intrinsics. Also I would like to learn about Mellanox Bluefield architecture in context of speeding up MPI communication in the code.


Lead by Matthew Bate:M.R.Bate@exeter.ac.uk

Smoothed particle hydrodynamics code, for studying astrophysical fluid dynamics. Not open source (private GIT repository hosted on BitBucket). About a dozen active users. Parallelised using both OpenMP and MPI, usually run in hybrid MPI/OpenMP mode. Involves a lot of memory access. Typically run on ~256-2048 cores. Unsure what performance gains might be. Speed improvements would lead to larger parameter space and/or higher resolution calculations. Attended the ARM hackathon in January.

TEAM: Seven-League Hackers (SLH)

Lead by Dr Philipp Edelmann: philipp.edelmann@ncl.ac.uk

Their code, the Seven-League Hydro (SLH) code, is a finite-volume hydrodynamics code, mainly intended for use in stellar astrophysics. Its most distinguishing feature is the fully implicit time discretization, which allows it to run efficiently for problems at low Mach numbers,such as the deep interiors of stars. The resulting large non-linear system is very challenging to solve and needs Newton-Krylov methods to be efficient.

It is designed with flexibility in mind, supporting arbitrary structured curvilinear grids, differentequations of state and a selection of other physics modules (nuclear reactions, neutrino losses, …).

More information about the code at: https://slh-code.org/features.html

The code currently consists of 72000 lines of (mostly) Fortran code.

Most of the code is written in modern Fortran 90, with certain select features from F2003 and F2008. Small parts are written in C, mostly code interacting with the operation system. This is a deliberate choice to ensure maximum portability across compilers and HPC platforms.

The code needs any implementation of BLAS and LAPACK. Parallelisation is done via MPIand/or OpenMP.

The code is currently not open source, but our group is open to pass the code on in collaborations.

SLH shows excellent scaling even on very large clusters. The largest tests so far were run on a Bluegene/Q 131 072 cores.

The iterative linear solvers are largely memory bandwidth dominated. This prevents us from reaching the performance of more floating-point dominated codes. By making sure SLH runs well on ARM systems, we we would have an ideal architecture to efficiently run our simulations.

Our simulations of the stellar interior need to cover a long simulation time in order to extract quantities such as wave spectra and entrainment rates. The improved performance would enable us to extract more detailed results and cover a wider range of parameters in our models.

https://git.slh-code.org (access granted on request)

TEAM: The arm of SWIFT

Lead by Dr Matthieu Schaller: schaller@strw.leidenuniv.nl

This code is Astrophysics. Gravity + hydrodynamics using SPH. It uses C (GNU 99). Libraries: HDF5 (parallel), FFTW (non-MPI), GSL, MPI (standard 3.1)

Get the vectorized aspects of SWIFT to use the ARM intrinsics. Possibly tweak the i/o, which has until recently been problematic on the system.

Ability to run efficiently on more architectures. The current catalyst system is ideal for running middle-sized simulations and we would like to exploit it as much as possible. Some small performance boost is necessary before this becomes a real alternative to the systems we currently use.

Mostly ARM as earlier tests of Bluefied showed that SWIFT is not bound by the speed/latency of the global MPI comms. We’d still be interested in discussing options with Mellanox experts.



ARM Hackathon (Leicester) Summary

Josh Borrow (JB), Alexei Borrisov (AB), Matthieu Schaller (MS)
STFC Colleagues
James Clark (JC), Aidan Chalk (AC)


The ARM Hackathon at Leicester was very helpful, and we were able to fully build and test the SWIFT code on the platform during the week. The assistance and knowledge of our colleagues from the STFC Hartree centre was invaluable throughout the four days. Work was started and is now in
the testing phase for hand-written NEON intrinsics for the core routines in the code, tailored specifically for the ARM ThunderX2 platform. SWIFT shows good strong-scaling performance on the ARM system.

The whole team had a very positive experience at the ARM Hackathon. It was incredibly useful to have the ARM team there to immediately assist us with issues that arose throughout the few days. We are looking forward to running a production science simulation on the Leicester machine if at all

Test Case

In the following, we consider the EAGLE_25 test case (available in the SWIFT repository) with 3763 (53M) particles. This includes 53M hydrodynamics particles, with 106M gravity particles. All runs presented below (unless otherwise stated) use hydrodynamics (with the Gadget-2 cross-compatibility scheme), cosmology, and gravity. 1024 steps were chosen as this includes a reasonable (~10) number of long steps, and many short steps. These were able to be completed in approximately 15 minutes on the ARM system on a single node.

Initial Performance

The plot below shows the initial performance on one single node with the non-MPI version of SWIFT.

The cumulative time spent is shown (i.e. lower is better) with three combinations. ARM clang (v19.0.2, henceforth we drop the version number) is shown in two runs; the first without the amath library and the second (once JC had profiled the code to reveal that 60% of the runtime was spent in expf) with the ARM performance library. We saw a significant performance in time-to-solution with this library. GCC8.2.0 gave slow performance and so from this moment forward we moved on with
only the ARM clang compiler. These initial runs were completed without cosmology as the GSL was not yet built for the system.

Compilers and Allocators

Once some initial performance testing had been performed, we moved on to test out allocators. On Intel systems, the parallel allocators and pinning threads give us a significant performance increase. The GSL had now been built by us on the ARM system and so the runs below are with cosmology.
All runs were completed with ARM clang. These results also have the amath library included.

Just as with the Intel systems, tbbmalloc gives us an improved time-to-solution over the standard code and over tcmalloc. Pinning also gives a significant improvement, especially with tbbmalloc. Comparing these with the Skylake system (COSMA-7; Durham MI DiRAC), and the Sandy Bridge systems (COSMA-5; Durham, ex-DiRAC) below, we see that the ARM system in Leicester sits somewhere between the two. Again, it is worth noting that the runs below were completed without MPI using the non-MPI binary of SWIFT.

It is worth noting that both of the Intel machines shown above do use our hand-written intrinsics for vectorising the hydrodynamics calculations. Work is underway (by AC and AB) to port these intrinsics over to the NEON instruction set.

Performance over MPI

SWIFT benefits from being ran in MPI mode using one rank per socket on some machines. Below we show comparisons to the two Durham systems (using the Intel compiler and Intel MPI, from 2018), and the Catalyst system (ARM clang, OpenMPI 4.0.0). The dotted, faint, lines below show the
performance when using the non-MPI version of SWIFT.

We see that the ARM system sees a significant benefit from moving to two processors per node, and manages to scale similarly over two nodes than the Skylake system. We see that two nodes of the ARM system (for this configuration) are required to recover the performance of a single Skylake node.

SMT Modes

To determine the best SMT mode to run in JC run a number of tests on various nodes that were booted by the system administrators. All runs below use ARM clang, and two ranks per node (i.e. one per socket) with OpenMPI.

These results revealed the sweet spot (as did the runs of other codes on the system) of the 2 SMT threads per processor. It also highlights the overhead of using SMT2 or SMT4 booted nodes in a single-threaded mode, that other users should be careful to beware. The small gain that we see from the threads per processor may not be worth it; the system will be more usable when booted in SMT1 mode for e.g. MPI-only codes.

Scaling Results

Once preliminary performance testing was performed, we ran a small (single-node) scaling study on the system. Two ranks per node were used for both systems.

The 56 core result for the Skylake system used two 28 core nodes. Here we see that the ARM system scales exceptionally well, retaining a parallel efficiency of 0.9 at 8 threads. It is worth noting that a production simulation would be ran at a similar load per thread as 8 here. The Skylake result also includes the hand-written intrinsics for the Intel platform so this is one possible reason why the time-to-solution is so strong.

Porting Vector Code

Significant progress was made in porting the hand-written vector intrinsics (see vector.h in SWIFT) to work on the ARM with NEON. We are currently in the process of writing unit tests for these.


Build flags that were used over the week by AC:

$ ../configure –disable-hand-vec –with-arm-fftw –enable-debug –with-gsl=/home/dc-chalk/gsl-install CC=armclang LDFLAGS=-mcpu=”native -armpl” CFLAGS=”-mcpu=native -armpl”

$ ../configure –disable-hand-vec –with-arm-fftw –enable-debug –with-gsl=/home/dc-chalk/gsl-install CC=armclang

$ ../configure –disable-hand-vec –enable-debug –with-arm-fftw=/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_arm-hpccompiler_19.0_aarch64-linux/lib CC=armclang LDFLAGS=”-mcpu=native -armpl” CFLAGS=”-mcpu=native -armpl” –with-gsl=/home/
dc-chalk/gsl-install –with-tbbmalloc=/home/dc-clark/tbb-2019_U3/build/

$ ../configure –disable-hand-vec –with-tcmalloc –with-arm-fftw –enable-debug –with-gsl=/home/dc-chalk/gsl-install
CC=armclang LDFLAGS=”-mcpu=native -armpl” CFLAGS=”-mcpu=native -armpl”

build-clang-tcmalloc-phdf5 #Not yet working
$ ../configure –with-hdf5=/home/dc-chalk/hdf5-install/bin/h5pcc –disable-hand-vec –enable-debug –with-arm-fftw=/
opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_arm-hpc-compiler_19.0_aarch64-linux/lib CC=armclang LDFLAGS=”-mcpu=native –
armpl” CFLAGS=”-mcpu=native -armpl” –with-gsl=/home/dc-chalk/gsl-install –with-tbbmalloc=/home/dc-clark/tbb-2019_U3/
build/linux_aarch64_gcc_cc4.8_libc2.22_kernel4.4.156_release CC=armclang

$ ../configure –disable-hand-vec –with-arm-fftw –enable-debug –with-gsl=/home/dc-chalk/gsl-install LDFLAGS=”-
mcpu=native -L/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/lib -lamath -lm” CFLAGS=”-mcpu=native
-I/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/include -L/opt/arm/
armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/lib -lamath -lm”

$ ../configure –disable-hand-vec –with-tcmalloc –enable-armv8-cntvct-el0 –with-arm-fftw –enable-debug –withgsl=/home/dc-chalk/gsl-install LDFLAGS=”-L/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/lib –
lamath -lm” CFLAGS=”-I/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/include -L/opt/arm/
armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/lib -lamath -lm”

To run on the system with 2 MPI ranks per node:

mpirun -np 4 -npernode 2 –report-bindings –bind-to socket –map-by socket –mca btl ^openib


DiRAC Day 2019

Our team, working on the Seven-League Hydro (SLH) code, is based normally based at Newcastle University and the Heidelberg Institute for Theoretical Studies (HITS) in Germany. We are both active DiRAC users through various collaborations. As it is usual these days we do almost all of our day-to-day simulations on x86-based clusters. Yet our previous experience with IBM Power systems, let us to build our codes in a very portable manner, so we jumped at the possibility to try it on an ARM cluster during the hackathon. Knowing ARM CPUs just from mobile phones and the Raspberry Pi so far, this was a great opportunity to test out this architecture’s potential in HPC and provide direct feedback to the people managing the DiRAC centres.

SLH is a finite-volume, astrophysical hydrodynamics code with specific focus on low Mach number flows. One of its distinguishing features is that it can do fully implicit time-stepping, which involves solving a large non-linear system using a Newton method, which in turn is making use of iterative linear solver solvers. These large systems involve a lot of collective MPI communication, but we found them to scale quite well even to large machines of more than 100 000 cores. Previous measurements revealed that large parts of the code are limited by the memory bandwidth, which made us curious if the improved memory bandwidth per core on the Marvell ThunderX2 ARM chips would give us benefits. SLH is mostly written in Fortran 95, with a few select and portable features from Fortran 2003. Additionally, there are some small parts written in C, mainly for I/O related routines.

After a few file system hiccups on the first day, due to the hardware only being available to the system administrators a few days before, we could get started getting the code to compile and run basic tests. Because the ARM cluster is a standard GNU/Linux system, this was not really harder than getting everything to run on a standard x86 cluster. It mainly boiled down to finding the location of the libraries matching the compiler and
MPI implementation. Issues with missing libraries could be resolved quickly with the administrators sitting at the next table. After that the code was running fine using GCC/GFortran and we could start the first tests.

ARM provides its own compiler based on LLVM. The front ends for C and Fortran are called clang and flang, respectively. We wanted to see how the ARM compiler would perform compared to the GNU compiler, for which
we have much experience. It was quite uplifting to see that our strict adherence to the language standards meant that there were no source level changes needed and porting just meant find the right equivalent of various compiler flags and integrating those into our build system. This meant we also had the ARM compiler working well on the second day.

The other groups reported a slight performance increase by using the ARM compiler instead of GCC. We saw the opposite trend with SLH, which lead us to investigate. SLH being the only Fortran code in the hackathon, we decided to use the ARM profiling tool Forge to find out which sections of the code are not being optimised properly. With the help of the ARM software engineers we managed to get the tool working and see detailed
timings of different code sections, without changing anything in the source. In the end the decrease in performance was caused by using pointer arrays instead of allocatable arrays inside a derived type. Thus a simple change
caused a speed-up of about 40%. We never observed this with GFortran or the Intel compiler on other machines and the one of the ARM engineers is reporting this back to the flang developers.

The other technology to be tried out was Mellanox Bluefield, which allows the user to directly access the computing power inside the Infiniband network infrastructure. The current software interface was too lowlevel for us to directly use in SLH, but we had some good discussions with the Mellanox engineers on how this could be useful for us in the future, namely moving part of the reduction operations in our linear solvers to the network.

We did some direct speed comparisons of Intel and ARM CPUs, running the same problem. We found that the individual ARM cores were slower than their Intel counterparts, but the increased number of cores per node outweighs this. In the end the run time per node was basically identical. Thus we conclude that the ARM architecture is definitely competitive for our kind of simulations and we are glad we had the time and support to make sure SLH will run on ARM systems that might become available in the future.

We are grateful to the organisers giving this great opportunity to test new technology. The event at Leicester College Court also allowed ample time to network with both DiRAC staff and other users.

Intel Optane Testimonial

The Intel Hackathon to explore the Intel Optane memory was organized at Durham University in June 2019. The workshop consisted of a great balance of talks as well as hands-on experiences. A series of talks by experts from Intel introduced the new Optane memory system. The hands-on sessions were very intensive with a lot of help available from Intel and Durham staff.

Our team from CCFE (Culham Centre for Fusion Energy) consisted of three members – S.J Pamela, L. Anton and D. Samaddar. Our laboratory specializes in fusion research and with a goal of having fusion energy on a commercial grid, we work with a very wide range of complex simulation codes. Our codes differ from one another in terms of algorithms, data volume and data structures as well as the physics they solve. Our team therefore used a number of codes as test-beds at the hackathon. Jorek is a 3-D nonlinear MHD code solving very large sparse matrices. OpenMC is a Monte-Carlo code that can be very memory intensive when sampling a large number of particles. BOUT++ on the other hand is a code package that studies turbulence involving strongly coupled non-linearities and multiscale physics. GS2 is a widely used code solving gyrokinetic equations in fusion plasma.

The Hackathon provided the right setting with great networking opportunities to generate the initial tests and lead the way for further explorations. For example, the Optane memory mode was found to be very beneficial for JOREK when the matrix size was increased. All the applications from CCFE used at the Hackathon are representative of other codes used in fusion research – so the results should benefit a wider range of HPC users within the community.

  1. A low resolution 2D FEM grid for the JOREK code, with the third dimension represented by Fourier harmonics.
  2. A snapshot of plasma instabilities simulated by JOREK.


United Kingdom Atomic Energy Authority

Intel Optane Hackathon

Intel agreed to sponsor a 3-day Optane hackathon at Durham. It’s aim was to learn how Intel® Optane™ memory and intelligent system acceleration work to deliver higher performance. This event provided the DiRAC community an opportunity to explore the potential for Intel’s new Optane memory in supporting their science.

The Hackathon was open to all DiRAC HPC users. There were 5 teams of between 2 and 3 people each. Over the three days, several major DiRAC science codes were ported to make efficient use of this and standard memory and the teams who attend gained the skills to assist other DiRAC researchers to port additional codes in the future.



lead by Dr Kacper Kornet from the University of Cambridge

GRChombo is a new open-source code for numerical general relativity
simulations. It is developed and maintained by a collaboration of numerical
relativists with a wide range of research interests, from early universe
cosmology to astrophysics and mathematical general relativity, and has been used in many papers since its first release in 2015.
GRChombo is written entirely in C++14, using hybrid MPI/OpenMP parallelism and vector intrinsics to achieve good performance on the latest architectures.
Furthermore, it makes use of the Chombo library for adaptive mesh refinement to allow automatic increasing and decreasing of the grid resolution in regions of arbitrary shape and topology.


lead by Sergey Yurchenko from University College London

TROVE is a variational method with an associated Fortran 2003 program to
construct and solve the ro-vibrational Schrödinger equation for a general
polyatomic molecule of arbitrary structure. The energies and eigenfunctions obtained via a variational approach can be used to model
absorption/emission intensities (molecular line lists) for a given temperature as well as to compute temperature independent line strengths and Einstein coefficients. A typical TROVE pipe line requires a construction and diagonalisation of about 200-300 double-real, dense symmetric matrices of sizes varying from 10,000×10,000 to 500,000×500,000. For the line list production it is important to compute almost all eigenvalues (or at least 50-80%) together with eigenvectors. The diagonalisations are efficiently done using external libraries, for example the DSYEV-family for smaller matrices (N< 200,000) or the PDSYEV-family for large matrices (N > 200,000). The TROVE program is highly optimized for the mass production of line lists for medium-size molecules applicable for high temperatures. TROVE has been extensively used to produce line lists for a number of key polyatomic molecules, including NH3 , CH4 , H2CO, H2CS, PH3 , SbH3 , HOOH, HSOH, CH3Cl, BiH3 , SO3 , SiH4 , CH3 , C2H4 , CH3F (about 80 papers in peer-reviewed journals). TROVE is a well-recognized method of the modern theoretical spectroscopy with the TROVE paper being highly cited.


lead by Prof Richard Bower from Durham University

Is a fine-grained task-parallel approach to cosmological simulations.
The code solves gravity and hydro equations, with additional sub-grid physics sources terms that encapsulate star formation, black holes, chemical enrichment, cooling etc.

The novelty of the code lies in its approach to parallelising this challenging
problem. It delivers a factor 10 speed improvement over the current state of
the art code, gadget.

The code is primarily used by the EAGLE simualtions team and at JPL for
planetary science. It is a new code and we expect widespread adoption as
the results are published.


lead by Arjen Tamerus from the University of Cambridge

MODAL_LSS is an astrophysics code used for estimation of the bispectrum of simulation data or observational data, using the MODAL methodology. It is currently in active development and not (yet) open source.


lead by Adrianne Slye from Oxford University

This team was using the open source code RAMSES
written by Romain Teyssier.
While mainly designed to study structure formation in an expanding Universe, the code has been applied to a range of problems dealing with self-gravitating MHD fluids (turbulence, planet formation) and is also equipped with a module to solve the radiative transfer equations. There also exists versions of the code with a modified Poisson solver to study alternatives to Einstein’s gravity.
The RAMSES community consists of about ~150 users around the world, with ~15 in the UK.

SLH (Seven-League Hydro) Code

lead by Phillipp Edelmann from Newcastle University

SLH is a finite-volume hydrodynamics code, mainly intended for use in stellar astrophysics. Its most distinguishing feature is the fully implicit time
discretization, which allows it to run efficiently for problems at low Mach
numbers, such as the deep interiors of stars. The resulting large non-linear
system is very challenging to solve and needs Newton-Krylov methods to be

It is designed with flexibility in mind, supporting arbitrary structured curvilinear grids, different equations of state and a selection of other physics modules (nuclear reactions, neutrino losses, …).

More information about the code at: https://slh-code.org/features.html
The code is currently not open source, but our group is open to pass the code on in collaborations.


lead by Rohini Joshi of the University of Manchester

They were using WSCLEAN (https://sourceforge.net/p/wsclean/wiki/Home/), a C++ code designed to perform Fast Fourier Transforms (FFTs) of large
datasets produced by connected arrays of radio telescopes (interferometers), and then perform iterative beam-deconvolution (following the CLEAN algorithm; Hogbom, 1974) to produce sensitive, high-resolution, wide-field images for astrophysical interpretation. eMERLIN, the UK National Facility for high-resolution radio astronomy, has proved a productive instrument for smaller (few nights, yielding ~100GB datasets) PI-led studies, imaging narrow fields of view (4 mega-pixels), however the flagship “legacy” programs, which address key STFC science goals, entail several thousand hours of on-sky data aquisition and have produced datasets of the order >>1-10TB over fields of view orders of magnitude larger (~2 giga-pixels). Producing images from these legacy programmes has thus presented a unique “big data” challenge.

In recent years an experienced corps of radio astronomers have pushed the
boundaries of radio imaging techniques to deliver eMERLIN legacy science
using machines with ~512GB–1.5TB RAM, however to circumvent bottlenecks in our existing infrastructure and fully deliver on the potential of these datasets in the coming years we envisage even higher memory

These same challenges have been encountered by other modern radio
telescopes (e.g. LOFAR, ASKAP, MeerKAT, and will be critical to science
delivery from the Square Kilometer Array in the next decate). WSCLEAN is a modern imaging package designed to meet the needs of current and
forthcoming facilities. It is rapidly becoming the standard wide-field imaging package, replacing older packages like AIPS and CASA. The source code is publicly available.

We have build a Docker file containing WSCLEAN and all required
dependencies: https://hub.docker.com/r/lofaruser/imaging-pipeline/ contains WSClean (version 2.6)


NVidia Hackathon
9th, 10th & 11th September
Swansea University

Call for Team Applications

We are pleased to announce that Nvidia have generously agreed to sponsor a 3-day GPU hackathon in Swansea prior to DiRAC Day 2018. This team event will provide the DiRAC community with the opportunity to explore the potential for GPUs in supporting their science.

The Hackathon is open to all DiRAC HPC users and we expect to be able to offer places to 5 or 6 teams of between 3 and 5 people each. Over the three days, we hope that several major DiRAC science codes will be ported to GPUs and that the teams who attend will gain the skills to assist other DiRAC researchers to port additional codes in the future. This is part of our on-going work to ensure that DiRAC provides the most appropriate hardware for your science and the hackathon will help provide input to discussions on the design of future DiRAC systems.

No prior experience of GPU programming is required – there will be online training material in advance of the hackathon itself to provide an introduction. Teams of 3-5 people can apply with 1 or 2 codes to be worked on. It’s important that all those who attend are familiar with the code that their team will be working on.

Download the Application Form and return it to the DiRAC Project Office by the 23rd July 2018. We will contact all applicants with the results in early August.

Accomodation booking is available through Swansea University’s DiRAC Day website. Some funding for a small number of students to help with accomodation may be available and if you or your team members would like to apply for this, please indicate numbers on your Application Form at Q11.