DiRAC Day 2019
Our team, working on the Seven-League Hydro (SLH) code, is based normally based at Newcastle University and the Heidelberg Institute for Theoretical Studies (HITS) in Germany. We are both active DiRAC users through various collaborations. As it is usual these days we do almost all of our day-to-day simulations on x86-based clusters. Yet our previous experience with IBM Power systems, let us to build our codes in a very portable manner, so we jumped at the possibility to try it on an ARM cluster during the hackathon. Knowing ARM CPUs just from mobile phones and the Raspberry Pi so far, this was a great opportunity to test out this architecture’s potential in HPC and provide direct feedback to the people managing the DiRAC centres.
SLH is a finite-volume, astrophysical hydrodynamics code with specific focus on low Mach number flows. One of its distinguishing features is that it can do fully implicit time-stepping, which involves solving a large non-linear system using a Newton method, which in turn is making use of iterative linear solver solvers. These large systems involve a lot of collective MPI communication, but we found them to scale quite well even to large machines of more than 100 000 cores. Previous measurements revealed that large parts of the code are limited by the memory bandwidth, which made us curious if the improved memory bandwidth per core on the Marvell ThunderX2 ARM chips would give us benefits. SLH is mostly written in Fortran 95, with a few select and portable features from Fortran 2003. Additionally, there are some small parts written in C, mainly for I/O related routines.
After a few file system hiccups on the first day, due to the hardware only being available to the system administrators a few days before, we could get started getting the code to compile and run basic tests. Because the ARM cluster is a standard GNU/Linux system, this was not really harder than getting everything to run on a standard x86 cluster. It mainly boiled down to finding the location of the libraries matching the compiler and
MPI implementation. Issues with missing libraries could be resolved quickly with the administrators sitting at the next table. After that the code was running fine using GCC/GFortran and we could start the first tests.
ARM provides its own compiler based on LLVM. The front ends for C and Fortran are called clang and flang, respectively. We wanted to see how the ARM compiler would perform compared to the GNU compiler, for which
we have much experience. It was quite uplifting to see that our strict adherence to the language standards meant that there were no source level changes needed and porting just meant find the right equivalent of various compiler flags and integrating those into our build system. This meant we also had the ARM compiler working well on the second day.
The other groups reported a slight performance increase by using the ARM compiler instead of GCC. We saw the opposite trend with SLH, which lead us to investigate. SLH being the only Fortran code in the hackathon, we decided to use the ARM profiling tool Forge to find out which sections of the code are not being optimised properly. With the help of the ARM software engineers we managed to get the tool working and see detailed
timings of different code sections, without changing anything in the source. In the end the decrease in performance was caused by using pointer arrays instead of allocatable arrays inside a derived type. Thus a simple change
caused a speed-up of about 40%. We never observed this with GFortran or the Intel compiler on other machines and the one of the ARM engineers is reporting this back to the flang developers.
The other technology to be tried out was Mellanox Bluefield, which allows the user to directly access the computing power inside the Infiniband network infrastructure. The current software interface was too lowlevel for us to directly use in SLH, but we had some good discussions with the Mellanox engineers on how this could be useful for us in the future, namely moving part of the reduction operations in our linear solvers to the network.
We did some direct speed comparisons of Intel and ARM CPUs, running the same problem. We found that the individual ARM cores were slower than their Intel counterparts, but the increased number of cores per node outweighs this. In the end the run time per node was basically identical. Thus we conclude that the ARM architecture is definitely competitive for our kind of simulations and we are glad we had the time and support to make sure SLH will run on ARM systems that might become available in the future.
We are grateful to the organisers giving this great opportunity to test new technology. The event at Leicester College Court also allowed ample time to network with both DiRAC staff and other users.