This was our first joint DiRAC-ARM Mellanox Hackathon held in September 2019 at the University of Leicester (prior to the DiRAC Day activities). It was open to all users but targeted at DiRAC groups with greatest readiness to investigate the possibilities with Mellanox’s BlueField technology. This 3-day event structured as a mixture of expert presentations and user code development time. Provided an introduction to BlueField and the ARM development environment. The event ended with presentations by the teams of their results at DiRAC Day Conference.
In total six teams from the DiRAC community attended. It was a great success with all teams reported that they would participate in any future hackathon. Detail about the teams and results are below.
Lead by Rosie Talbot email: Rt421@cam.ac.uk
AREPO is a massively parallel, cosmological, magneto hydrodynamics simulation including N-body solver for gravity and hydrodynamics solved on a moving mesh. The code can simulate a wide range of astrophysical systems from modelling representative volumes of the Universe down to understanding planet formation.
AREPO is written in C and includes communication instructions using standardisedMPI. Additionally, it uses open source libraries: GSL, GMP, FFTW and HDF5.
AREPO has been heavily optimized, tests of the scaling capabilities and parallelization of the code show excellent performance. From benchmarking by the DiRAC facility, compared to other cosmological codes it is one of the best performers available in terms of speed and has a highly parallel I/O which performs at < 0.5% of the total run time. The code has highly optimized domain decomposition and work balance algorithms and has been shown to scale well up to thousands of cores.
We are open to explore various possibilities to improve code performance through Mellanox technologies. The code has a large number of MPI calls/functions that couldinstantly benefit from Mellanox SHARP while it may also be possible to offload subgridphysics modules or a subset of functions to the Bluefield chips.
In speeding up the code, it will have an impact across all of our research areas which span understanding star formation in dwarf galaxies to modelling the cosmic evolution of large-scale structure.
We are open to using both. Closed source, but info available: http://www.mpa-garching.mpg.de/~volker/arepo/
lead by Thomas Guillet: T.A.Guillet@exeter.ac.uk
The AREPO code simulates general self-gravitating fluid dynamics for astrophysics, and is used to study a wide range of problems, including the formation and evolution of galaxies, turbulence in the interstellar medium, the formation of the first stars or the interaction of black holes with their surrounding galactic environment. AREPO-DG is a specific experimental solver of the AREPO code implementing a high-order discontinuous Galerkin (DG) method on adaptive mesh refinement (AMR) grids, applied to ideal compressible magnetohydrodynamics (MHD) flows (http://adsabs.harvard.edu/abs/2019MNRAS.485.4209G).
The AREPO code is used by a number of research groups in Germany (Munich,Heidelberg, Potsdam), the UK (Cardiff, Durham), and the US (Harvard, MIT, Chicago). AREPO-DG is experimental at this stage, and primarily used and developed by the primary contact.
AREPO-DG is compute-intensive due to its high-order scheme. Most operations are small dense linear algebra operations, benefitting from vector units. However, memory bandwidth is also important, because at intermediate scheme orders (most useful in practice in astrophysics) the arithmetic intensity of the scheme is around 5-10 FP ops/byte, which is still stressing the DRAM bandwidth on most x86 architectures. Some numerical ingredients in the code are also more sensitive to DRAM bandwidth.
I would expect the ThunderX2 architecture to help make the code closer to compute bound, thanks to its increased architectural byte/FLOP, after some cache optimizations in the code. I am also interested in exploring how the Mellanox Bluefield technology can improve extreme scaling of the code.
Any performance improvement obtained over the hackathon will directly benefit upcoming AREPO-DG simulations of MHD turbulence, that are being prepared in conjunction with corresponding DiRAC and STFC proposals. These simulations aim at understanding theamplification of magnetic fields in turbulent flows, a candidate mechanism to explain the ubiquitous magnetic fields observed today in the Universe. In addition, these simulations will help understand the numerical properties, performance characteristics, and implementation techniques of high-order schemes, and will contribute to their wider application to astrophysics, but also to broader domains of computational fluid dynamics where DG schemes are gaining traction.
Lead by: Asier Lopez-Eiguren: email@example.com
Our code solves a second order differential equation in a 3D discrete lattice in order to analyse the evolution of a scalar field with N components. The main code ON is composed by 4000 lines of code and the Latfield2d library is composed by 6000 lines, both of them are written in C++ and can be downloaded from the links below.
As we described in our DiRAC-2.5y Director’s Discretionary time application bigger simulations are necessary close a debate about axion dark matter production from axion strings. We were able to run 4k3 in Dial but bigger simulations will improve even further the finite-size scaling analysis. In order to achieve this goal, we have to improve the performance of our code and we think that the Hackathon will help us.
The improvement of the ON code will help us to analyse in more detail the axion strings. Axion strings are an important source of dark matter axions, and their density feeds into the calculation of the axion DM relic density. Moreover, the improvement of the Latfield2d library will help to enlarge field theoretical simulations and therefore will help to solve many problems related with the finite-size effects.
Lead by Dr Kacper Kornet: firstname.lastname@example.org
AMR code for general relativity simulations. Built using C++, MPI, HDF5. For best performance GRChombo can also us Intel intrinsics
Currently typical GRChombo runs on Dirac systems use 16-32 KNL nodes with around 90GB of memory per rank.
Check performance characteristic of the code on ARM architecture. Especially learn how to vectorize a templated C++ code. That would probably require equivalent of Intel intrinsics. Also I would like to learn about Mellanox Bluefield architecture in context of speeding up MPI communication in the code.
Lead by Matthew Bate:M.R.Bate@exeter.ac.uk
Smoothed particle hydrodynamics code, for studying astrophysical fluid dynamics. Not open source (private GIT repository hosted on BitBucket). About a dozen active users. Parallelised using both OpenMP and MPI, usually run in hybrid MPI/OpenMP mode. Involves a lot of memory access. Typically run on ~256-2048 cores. Unsure what performance gains might be. Speed improvements would lead to larger parameter space and/or higher resolution calculations. Attended the ARM hackathon in January.
TEAM: Seven-League Hackers (SLH)
Lead by Dr Philipp Edelmann: email@example.com
Their code, the Seven-League Hydro (SLH) code, is a finite-volume hydrodynamics code, mainly intended for use in stellar astrophysics. Its most distinguishing feature is the fully implicit time discretization, which allows it to run efficiently for problems at low Mach numbers,such as the deep interiors of stars. The resulting large non-linear system is very challenging to solve and needs Newton-Krylov methods to be efficient.
It is designed with flexibility in mind, supporting arbitrary structured curvilinear grids, differentequations of state and a selection of other physics modules (nuclear reactions, neutrino losses, …).
More information about the code at: https://slh-code.org/features.html
The code currently consists of 72000 lines of (mostly) Fortran code.
Most of the code is written in modern Fortran 90, with certain select features from F2003 and F2008. Small parts are written in C, mostly code interacting with the operation system. This is a deliberate choice to ensure maximum portability across compilers and HPC platforms.
The code needs any implementation of BLAS and LAPACK. Parallelisation is done via MPIand/or OpenMP.
The code is currently not open source, but our group is open to pass the code on in collaborations.
SLH shows excellent scaling even on very large clusters. The largest tests so far were run on a Bluegene/Q 131 072 cores.
The iterative linear solvers are largely memory bandwidth dominated. This prevents us from reaching the performance of more floating-point dominated codes. By making sure SLH runs well on ARM systems, we we would have an ideal architecture to efficiently run our simulations.
Our simulations of the stellar interior need to cover a long simulation time in order to extract quantities such as wave spectra and entrainment rates. The improved performance would enable us to extract more detailed results and cover a wider range of parameters in our models.
https://git.slh-code.org (access granted on request)
TEAM: The arm of SWIFT
Lead by Dr Matthieu Schaller: firstname.lastname@example.org
This code is Astrophysics. Gravity + hydrodynamics using SPH. It uses C (GNU 99). Libraries: HDF5 (parallel), FFTW (non-MPI), GSL, MPI (standard 3.1)
Get the vectorized aspects of SWIFT to use the ARM intrinsics. Possibly tweak the i/o, which has until recently been problematic on the system.
Ability to run efficiently on more architectures. The current catalyst system is ideal for running middle-sized simulations and we would like to exploit it as much as possible. Some small performance boost is necessary before this becomes a real alternative to the systems we currently use.
Mostly ARM as earlier tests of Bluefied showed that SWIFT is not bound by the speed/latency of the global MPI comms. We’d still be interested in discussing options with Mellanox experts.
ARM/Mellanox Hackacthon Testimonial
Our team, working on the Seven-League Hydro (SLH) code, is based normally based at Newcastle University and the Heidelberg Institute for Theoretical Studies (HITS) in Germany. We are both active DiRAC users through various collaborations. As it is usual these days we do almost all of our day-to-day simulations on x86-based clusters. Yet our previous experience with IBM Power systems, let us to build our codes in a very portable manner, so we jumped at the possibility to try it on an ARM cluster during the hackathon. Knowing ARM CPUs just from mobile phones and the Raspberry Pi so far, this was a great opportunity to test out this architecture’s potential in HPC and provide direct feedback to the people managing the DiRAC centres.
SLH is a finite-volume, astrophysical hydrodynamics code with specific focus on low Mach number flows. One of its distinguishing features is that it can do fully implicit time-stepping, which involves solving a large non-linear system using a Newton method, which in turn is making use of iterative linear solver solvers. These large systems involve a lot of collective MPI communication, but we found them to scale quite well even to large machines of more than 100 000 cores. Previous measurements revealed that large parts of the code are limited by the memory bandwidth, which made us curious if the improved memory bandwidth per core on the Marvell ThunderX2 ARM chips would give us benefits. SLH is mostly written in Fortran 95, with a few select and portable features from Fortran 2003. Additionally, there are some small parts written in C, mainly for I/O related routines.
After a few file system hiccups on the first day, due to the hardware only being available to the system administrators a few days before, we could get started getting the code to compile and run basic tests. Because the ARM cluster is a standard GNU/Linux system, this was not really harder than getting everything to run on a standard x86 cluster. It mainly boiled down to finding the location of the libraries matching the compiler and
MPI implementation. Issues with missing libraries could be resolved quickly with the administrators sitting at the next table. After that the code was running fine using GCC/GFortran and we could start the first tests.
ARM provides its own compiler based on LLVM. The front ends for C and Fortran are called clang and flang, respectively. We wanted to see how the ARM compiler would perform compared to the GNU compiler, for which
we have much experience. It was quite uplifting to see that our strict adherence to the language standards meant that there were no source level changes needed and porting just meant find the right equivalent of various compiler flags and integrating those into our build system. This meant we also had the ARM compiler working well on the second day.
The other groups reported a slight performance increase by using the ARM compiler instead of GCC. We saw the opposite trend with SLH, which lead us to investigate. SLH being the only Fortran code in the hackathon, we decided to use the ARM profiling tool Forge to find out which sections of the code are not being optimised properly. With the help of the ARM software engineers we managed to get the tool working and see detailed
timings of different code sections, without changing anything in the source. In the end the decrease in performance was caused by using pointer arrays instead of allocatable arrays inside a derived type. Thus a simple change
caused a speed-up of about 40%. We never observed this with GFortran or the Intel compiler on other machines and the one of the ARM engineers is reporting this back to the flang developers.
The other technology to be tried out was Mellanox Bluefield, which allows the user to directly access the computing power inside the Infiniband network infrastructure. The current software interface was too lowlevel for us to directly use in SLH, but we had some good discussions with the Mellanox engineers on how this could be useful for us in the future, namely moving part of the reduction operations in our linear solvers to the network.
We did some direct speed comparisons of Intel and ARM CPUs, running the same problem. We found that the individual ARM cores were slower than their Intel counterparts, but the increased number of cores per node outweighs this. In the end the run time per node was basically identical. Thus we conclude that the ARM architecture is definitely competitive for our kind of simulations and we are glad we had the time and support to make sure SLH will run on ARM systems that might become available in the future.
We are grateful to the organisers giving this great opportunity to test new technology. The event at Leicester College Court also allowed ample time to network with both DiRAC staff and other users.