This was our first joint DiRAC-ARM Mellanox Hackathon held in September 2019 at the University of Leicester (prior to the DiRAC Day activities). It was open to all users but targeted at DiRAC groups with greatest readiness to investigate the possibilities with Mellanox’s BlueField technology. This 3-day event structured as a mixture of expert presentations and user code development time. Provided an introduction to BlueField and the ARM development environment. The event ended with presentations by the teams of their results at DiRAC Day Conference.
In total six teams from the DiRAC community attended. It was a great success with all teams reported that they would participate in any future hackathon. Detail about the teams and results are below.
Lead by Rosie Talbotemail: Rt421@cam.ac.ukCodea
AREPO is a massively parallel, cosmological, magneto hydrodynamics simulation including N-body solver for gravity and hydrodynamics solved on a moving mesh. The code can simulate a wide range of astrophysical systems from modelling representative volumes of the Universe down to understanding planet formation.
AREPO is written in C and includes communication instructions using standardisedMPI. Additionally, it uses open source libraries: GSL, GMP, FFTW and HDF5.
AREPO has been heavily optimized, tests of the scaling capabilities and parallelization of the code show excellent performance. From benchmarking by the DiRAC facility, compared to other cosmological codes it is one of the best performers available in terms of speed and has a highly parallel I/O which performs at < 0.5% of the total run time. The code has highly optimized domain decomposition and work balance algorithms and has been shown to scale well up to thousands of cores.
We are open to explore various possibilities to improve code performance through Mellanox technologies. The code has a large number of MPI calls/functions that couldinstantly benefit from Mellanox SHARP while it may also be possible to offload subgridphysics modules or a subset of functions to the Bluefield chips.
In speeding up the code, it will have an impact across all of our research areas which span understanding star formation in dwarf galaxies to modelling the cosmic evolution of large-scale structure.
We are open to using both. Closed source, but info available: http://www.mpa-garching.mpg.de/~volker/arepo/
lead by Thomas Guillet: T.A.Guillet@exeter.ac.uk
The AREPO code simulates general self-gravitating fluid dynamics for astrophysics, and is used to study a wide range of problems, including the formation and evolution of galaxies, turbulence in the interstellar medium, the formation of the first stars or the interaction of black holes with their surrounding galactic environment. AREPO-DG is a specific experimental solver of the AREPO code implementing a high-order discontinuous Galerkin (DG) method on adaptive mesh refinement (AMR) grids, applied to ideal compressible magnetohydrodynamics (MHD) flows (http://adsabs.harvard.edu/abs/2019MNRAS.485.4209G).
The AREPO code is used by a number of research groups in Germany (Munich,Heidelberg, Potsdam), the UK (Cardiff, Durham), and the US (Harvard, MIT, Chicago). AREPO-DG is experimental at this stage, and primarily used and developed by the primary contact.
AREPO-DG is compute-intensive due to its high-order scheme. Most operations are small dense linear algebra operations, benefitting from vector units. However, memory bandwidth is also important, because at intermediate scheme orders (most useful in practice in astrophysics) the arithmetic intensity of the scheme is around 5-10 FP ops/byte, which is still stressing the DRAM bandwidth on most x86 architectures. Some numerical ingredients in the code are also more sensitive to DRAM bandwidth.
I would expect the ThunderX2 architecture to help make the code closer to compute bound, thanks to its increased architectural byte/FLOP, after some cache optimizations in the code. I am also interested in exploring how the Mellanox Bluefield technology can improve extreme scaling of the code.
Any performance improvement obtained over the hackathon will directly benefit upcoming AREPO-DG simulations of MHD turbulence, that are being prepared in conjunction with corresponding DiRAC and STFC proposals. These simulations aim at understanding theamplification of magnetic fields in turbulent flows, a candidate mechanism to explain the ubiquitous magnetic fields observed today in the Universe. In addition, these simulations will help understand the numerical properties, performance characteristics, and implementation techniques of high-order schemes, and will contribute to their wider application to astrophysics, but also to broader domains of computational fluid dynamics where DG schemes are gaining traction.
Lead by: Asier Lopez-Eiguren: firstname.lastname@example.org
Our code solves a second order differential equation in a 3D discrete lattice in order to analyse the evolution of a scalar field with N components. The main code ON is composed by 4000 lines of code and the Latfield2d library is composed by 6000 lines, both of them are written in C++ and can be downloaded from the links below.
As we described in our DiRAC-2.5y Director’s Discretionary time application bigger simulations are necessary close a debate about axion dark matter production from axion strings. We were able to run 4k3 in Dial but bigger simulations will improve even further the finite-size scaling analysis. In order to achieve this goal, we have to improve the performance of our code and we think that the Hackathon will help us.
The improvement of the ON code will help us to analyse in more detail the axion strings. Axion strings are an important source of dark matter axions, and their density feeds into the calculation of the axion DM relic density. Moreover, the improvement of the Latfield2d library will help to enlarge field theoretical simulations and therefore will help to solve many problems related with the finite-size effects.
Lead by Dr Kacper Kornet: email@example.com
AMR code for general relativity simulations. Built using C++, MPI, HDF5. For best performance GRChombo can also us Intel intrinsics
Currently typical GRChombo runs on Dirac systems use 16-32 KNL nodes with around 90GB of memory per rank.
Check performance characteristic of the code on ARM architecture. Especially learn how to vectorize a templated C++ code. That would probably require equivalent of Intel intrinsics. Also I would like to learn about Mellanox Bluefield architecture in context of speeding up MPI communication in the code.
Lead by Matthew Bate:M.R.Bate@exeter.ac.uk
Smoothed particle hydrodynamics code, for studying astrophysical fluid dynamics. Not open source (private GIT repository hosted on BitBucket). About a dozen active users. Parallelised using both OpenMP and MPI, usually run in hybrid MPI/OpenMP mode. Involves a lot of memory access. Typically run on ~256-2048 cores. Unsure what performance gains might be. Speed improvements would lead to larger parameter space and/or higher resolution calculations. Attended the ARM hackathon in January.
TEAM: Seven-League Hackers (SLH)
Lead by Dr Philipp Edelmann: firstname.lastname@example.org
Their code, the Seven-League Hydro (SLH) code, is a finite-volume hydrodynamics code, mainly intended for use in stellar astrophysics. Its most distinguishing feature is the fully implicit time discretization, which allows it to run efficiently for problems at low Mach numbers,such as the deep interiors of stars. The resulting large non-linear system is very challenging to solve and needs Newton-Krylov methods to be efficient.
It is designed with flexibility in mind, supporting arbitrary structured curvilinear grids, differentequations of state and a selection of other physics modules (nuclear reactions, neutrino losses, …).
More information about the code at: https://slh-code.org/features.html
The code currently consists of 72000 lines of (mostly) Fortran code.
Most of the code is written in modern Fortran 90, with certain select features from F2003 and F2008. Small parts are written in C, mostly code interacting with the operation system. This is a deliberate choice to ensure maximum portability across compilers and HPC platforms.
The code needs any implementation of BLAS and LAPACK. Parallelisation is done via MPIand/or OpenMP.
The code is currently not open source, but our group is open to pass the code on in collaborations.
SLH shows excellent scaling even on very large clusters. The largest tests so far were run on a Bluegene/Q 131 072 cores.
The iterative linear solvers are largely memory bandwidth dominated. This prevents us from reaching the performance of more floating-point dominated codes. By making sure SLH runs well on ARM systems, we we would have an ideal architecture to efficiently run our simulations.
Our simulations of the stellar interior need to cover a long simulation time in order to extract quantities such as wave spectra and entrainment rates. The improved performance would enable us to extract more detailed results and cover a wider range of parameters in our models.
https://git.slh-code.org (access granted on request)
TEAM: The arm of SWIFT
Lead by Dr Matthieu Schaller: email@example.com
This code is Astrophysics. Gravity + hydrodynamics using SPH. It uses C (GNU 99). Libraries: HDF5 (parallel), FFTW (non-MPI), GSL, MPI (standard 3.1)
Get the vectorized aspects of SWIFT to use the ARM intrinsics. Possibly tweak the i/o, which has until recently been problematic on the system.
Ability to run efficiently on more architectures. The current catalyst system is ideal for running middle-sized simulations and we would like to exploit it as much as possible. Some small performance boost is necessary before this becomes a real alternative to the systems we currently use.
Mostly ARM as earlier tests of Bluefied showed that SWIFT is not bound by the speed/latency of the global MPI comms. We’d still be interested in discussing options with Mellanox experts.