Background and Motivations
AREPO is a cosmological hydrodynamics solver used widely in the astrophysics community for numerically simulating galaxy formation and evolution.
The part of the code that we wanted to accelerate with GPU modifications was a module that creates visualisations of AREPO simulation data, either on-the-fly or in post-processing. This module works by ray-tracing through the Voronoi cells to create projected maps of the quantities of interest, e.g. gas column density or metallicity.
The AREPO code, including this visualisation module, is written in C and is parallelised to run on distributed CPU hardware using MPI. The visualisation module is just under 5000 lines of code, and makes use of the base AREPO libraries that handle mesh calculations and domain decomposition.
Since this projection process is inherently massively parallel, we envisioned modifying the code to perform the ray-tracing step on GPUs. Preparation of the mesh would be done on CPUs, after which it could be passed on to the GPUs.
Since each ray projection is independent of the others, and the projection calculation is essentially just addition across the cells, we believed this process had great scope for GPU acceleration.
The original visualisation software scales poorly with the number of CPU processes; partly this is because the ray-tracing calculation inherits the domain decomposition from the main codebase (which is optimised for the hydrodynamics solver, not for ray-tracing).
The AREPO code is used extensively throughout the astrophysics community. For example, it is the workhorse for the successful Illustris simulation suite and, more recently, its successor IllustrisTNG. Although currently closed source, an open source version is under construction. In addition to the large community of primary code users, there also exists a larger community of astronomers who use the Illustris simulation snapshots for a variety of purposes, e.g., creating mock catalogues for observations. We believed that both of these communities would benefit from a GPU-accelerated visualisation module. Furthermore, although we entered the hackathon to work on the AREPO visualisation module, we believe that this GPU acceleration is conceptually applicable to many other hydrodynamics codes (e.g. GIZMO; another Voronoi mesh solver).
All of us work in the numerical simulations group at the IoA, making extensive use of DiRAC HPC facilities. Prior to the hackathon we had general experience in CPU programming, including on distributed and shared memory applications using MPI and OpenMP, but limited experience with GPU programming. Alongside this experience, Sophie and Nick had used AREPO throughout their PhD work to simulate galaxy and galaxy cluster formation, and therefore had a good understanding of the wider codebase and how to utilise it effectively. Our team hoped to acquire both a strong conceptual and practical understanding of GPU programming at the hackathon. This training could then allow us to further modify the other codes that we use for our research to improve performance with GPUs.
The main algorithm of the visualisation module in AREPO is a ray-tracing algorithm. Firstly, the simulation volume is divided into cells of varying sizes that make up the Voronoi mesh. Then, a domain decomposition is performed, distributing these cells across CPU tasks according to spatial position, following a Peano-Hilbert curve approach. ‘Rays’ are then projected through the simulation volume, with each ray summing up the relevant physical quantity, e.g. gas density, as it travels across the cells.
At the start of the hackathon we hoped that with an appropriate change of domain decomposition we could utilise GPU hardware to accelerate this additive projection, whilst minimising the need for message-passing (which currently creates sufficient overhead that the scaling of the code is sub-optimal). However given the time constraints of the hackathon, we decided to focus just on converting the ray-tracing step to run on the GPU, and bypassed the domain decomposition problem by running AREPO on a single CPU core. This limits the applicability of our accelerated code to simulations that are small enough to fit into the memory of a single CPU node, however we believe that in future this domain decomposition issue should be straightforward to solve.
This original projection algorithm, which we wanted to run serially on one CPU core, can be written in the following pseudocode:
During the hackathon we focused our efforts on parallelising this ray-tracing procedure using OpenACC, so that each GPU thread carries a ray, as in the following pseudocode:
OpenACC is a “user-driven directive-based performance-portable parallel programming model designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model.” Our implementation of the GPU acceleration in AREPO involved indicating to the compiler which parts of the calculation could be performed on the GPU (e.g.
# pragma acc loop), and ensuring that the appropriate data was available on the GPU (e.g.
# pragma acc copy(data)).
The GPU-accelerated projection routine achieved a ~10x speedup compared with the original CPU code run on one core.
Note: the multi-core CPU run takes longer than the single-core because the simulation domain decomposition is optimised for hydrodynamics and not for ray-tracing.