SWIFT Team
Josh Borrow (JB), Alexei Borrisov (AB), Matthieu Schaller (MS)
STFC Colleagues
James Clark (JC), Aidan Chalk (AC)

Summary

The ARM Hackathon at Leicester was very helpful, and we were able to fully build and test the SWIFT code on the platform during the week. The assistance and knowledge of our colleagues from the STFC Hartree centre was invaluable throughout the four days. Work was started and is now in
the testing phase for hand-written NEON intrinsics for the core routines in the code, tailored specifically for the ARM ThunderX2 platform. SWIFT shows good strong-scaling performance on the ARM system.

The whole team had a very positive experience at the ARM Hackathon. It was incredibly useful to have the ARM team there to immediately assist us with issues that arose throughout the few days. We are looking forward to running a production science simulation on the Leicester machine if at all
possible.

Test Case

In the following, we consider the EAGLE_25 test case (available in the SWIFT repository) with 3763 (53M) particles. This includes 53M hydrodynamics particles, with 106M gravity particles. All runs presented below (unless otherwise stated) use hydrodynamics (with the Gadget-2 cross-compatibility scheme), cosmology, and gravity. 1024 steps were chosen as this includes a reasonable (~10) number of long steps, and many short steps. These were able to be completed in approximately 15 minutes on the ARM system on a single node.

Initial Performance

The plot below shows the initial performance on one single node with the non-MPI version of SWIFT.


The cumulative time spent is shown (i.e. lower is better) with three combinations. ARM clang (v19.0.2, henceforth we drop the version number) is shown in two runs; the first without the amath library and the second (once JC had profiled the code to reveal that 60% of the runtime was spent in expf) with the ARM performance library. We saw a significant performance in time-to-solution with this library. GCC8.2.0 gave slow performance and so from this moment forward we moved on with
only the ARM clang compiler. These initial runs were completed without cosmology as the GSL was not yet built for the system.

Compilers and Allocators

Once some initial performance testing had been performed, we moved on to test out allocators. On Intel systems, the parallel allocators and pinning threads give us a significant performance increase. The GSL had now been built by us on the ARM system and so the runs below are with cosmology.
All runs were completed with ARM clang. These results also have the amath library included.


Just as with the Intel systems, tbbmalloc gives us an improved time-to-solution over the standard code and over tcmalloc. Pinning also gives a significant improvement, especially with tbbmalloc. Comparing these with the Skylake system (COSMA-7; Durham MI DiRAC), and the Sandy Bridge systems (COSMA-5; Durham, ex-DiRAC) below, we see that the ARM system in Leicester sits somewhere between the two. Again, it is worth noting that the runs below were completed without MPI using the non-MPI binary of SWIFT.


It is worth noting that both of the Intel machines shown above do use our hand-written intrinsics for vectorising the hydrodynamics calculations. Work is underway (by AC and AB) to port these intrinsics over to the NEON instruction set.

Performance over MPI

SWIFT benefits from being ran in MPI mode using one rank per socket on some machines. Below we show comparisons to the two Durham systems (using the Intel compiler and Intel MPI, from 2018), and the Catalyst system (ARM clang, OpenMPI 4.0.0). The dotted, faint, lines below show the
performance when using the non-MPI version of SWIFT.

We see that the ARM system sees a significant benefit from moving to two processors per node, and manages to scale similarly over two nodes than the Skylake system. We see that two nodes of the ARM system (for this configuration) are required to recover the performance of a single Skylake node.

SMT Modes

To determine the best SMT mode to run in JC run a number of tests on various nodes that were booted by the system administrators. All runs below use ARM clang, and two ranks per node (i.e. one per socket) with OpenMPI.

These results revealed the sweet spot (as did the runs of other codes on the system) of the 2 SMT threads per processor. It also highlights the overhead of using SMT2 or SMT4 booted nodes in a single-threaded mode, that other users should be careful to beware. The small gain that we see from the threads per processor may not be worth it; the system will be more usable when booted in SMT1 mode for e.g. MPI-only codes.

Scaling Results

Once preliminary performance testing was performed, we ran a small (single-node) scaling study on the system. Two ranks per node were used for both systems.

The 56 core result for the Skylake system used two 28 core nodes. Here we see that the ARM system scales exceptionally well, retaining a parallel efficiency of 0.9 at 8 threads. It is worth noting that a production simulation would be ran at a similar load per thread as 8 here. The Skylake result also includes the hand-written intrinsics for the Intel platform so this is one possible reason why the time-to-solution is so strong.

Porting Vector Code

Significant progress was made in porting the hand-written vector intrinsics (see vector.h in SWIFT) to work on the ARM with NEON. We are currently in the process of writing unit tests for these.

Documentation

Build flags that were used over the week by AC:

build-clang
$ ../configure –disable-hand-vec –with-arm-fftw –enable-debug –with-gsl=/home/dc-chalk/gsl-install CC=armclang LDFLAGS=-mcpu=”native -armpl” CFLAGS=”-mcpu=native -armpl”

build-clang-nothing
$ ../configure –disable-hand-vec –with-arm-fftw –enable-debug –with-gsl=/home/dc-chalk/gsl-install CC=armclang

build-clang-tbbmalloc
$ ../configure –disable-hand-vec –enable-debug –with-arm-fftw=/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_arm-hpccompiler_19.0_aarch64-linux/lib CC=armclang LDFLAGS=”-mcpu=native -armpl” CFLAGS=”-mcpu=native -armpl” –with-gsl=/home/
dc-chalk/gsl-install –with-tbbmalloc=/home/dc-clark/tbb-2019_U3/build/
linux_aarch64_gcc_cc4.8_libc2.22_kernel4.4.156_release

build-clang-tcmalloc
$ ../configure –disable-hand-vec –with-tcmalloc –with-arm-fftw –enable-debug –with-gsl=/home/dc-chalk/gsl-install
CC=armclang LDFLAGS=”-mcpu=native -armpl” CFLAGS=”-mcpu=native -armpl”

build-clang-tcmalloc-phdf5 #Not yet working
$ ../configure –with-hdf5=/home/dc-chalk/hdf5-install/bin/h5pcc –disable-hand-vec –enable-debug –with-arm-fftw=/
opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_arm-hpc-compiler_19.0_aarch64-linux/lib CC=armclang LDFLAGS=”-mcpu=native –
armpl” CFLAGS=”-mcpu=native -armpl” –with-gsl=/home/dc-chalk/gsl-install –with-tbbmalloc=/home/dc-clark/tbb-2019_U3/
build/linux_aarch64_gcc_cc4.8_libc2.22_kernel4.4.156_release CC=armclang

build-gcc
$ ../configure –disable-hand-vec –with-arm-fftw –enable-debug –with-gsl=/home/dc-chalk/gsl-install LDFLAGS=”-
mcpu=native -L/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/lib -lamath -lm” CFLAGS=”-mcpu=native
-I/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/include -L/opt/arm/
armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/lib -lamath -lm”

build-gcc-tcmalloc
$ ../configure –disable-hand-vec –with-tcmalloc –enable-armv8-cntvct-el0 –with-arm-fftw –enable-debug –withgsl=/home/dc-chalk/gsl-install LDFLAGS=”-L/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/lib –
lamath -lm” CFLAGS=”-I/opt/arm/armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/include -L/opt/arm/
armpl-19.0.0_ThunderX2CN99_SUSE-12_gcc_8.2.0_aarch64-linux/lib -lamath -lm”

To run on the system with 2 MPI ranks per node:

mpirun -np 4 -npernode 2 –report-bindings –bind-to socket –map-by socket –mca btl ^openib