lead by Pof Peter Boyle

This is an established DiRAC project. Lattice QCD code, comprising C++14 data parallel template engine layer and physics code. This library is current Plan of Record for USQCD DOE Exascale Computing Project for cross platform performance portability at the Exascale. It is around 100k lines of code.

For more information see:



Under the hood provides optimised cross platform SIMD class class interface that maps to intrinsics on multicore CPU (x86: SSE4, AVX, AVX2, AVX512 + Zen AVXFMA. Power: QPX. ARM: Neon, SVE). The initial target was: MPI + OpenMP (threads) + abstracted SIMD intrinsics, and It implements Domain Wall Fermions, Wilson/Clover Fermions, Staggered Fermions.

Substantial work has been performed to use device lambdas to off load data parallel expressions and high performance sparse matrix code to GPU devices. As of July 2018 it obtains around 1.4 – 1.9 TF/s on a single Volta board on Summit, comprising around 90% of Nividia’s QUDA library performance with very little GPU specific code. It uses UVM to avoid explicit data management.

We need to port this code to use Nvlink between GPU’s in a single system and to use Gpu Direct RDMA between nodes. Communication will be key for this code.

It is GPL V2.0. The user community spans US, UK, Japan and Germany. Perhaps as many as 100 – 200 scientists and growing. It has strong connection to the ECP and Pathforward projects in the US.

The Team

Lead by Prof. Peter Boyle, principal developer of Grid, technical director of DiRAC. Who is very familiar with code since he wrote about 80% of it. He designed (i.e. the VHDL source) the BlueGene/Q memory prefetch engine for IBM Research. He worked on the QCDOC system on a chip supercomputer design with Columbia University and IBM Research (presilicon verification, wrote the O/S and high performance code, and performed all hardware debug). Presently engaged in the DOE/Intel Pathforward programme.

Jonas Glesaaen, Swansea PhD student and interested potential user, Michael Marshal, Edinburgh PhD student will be developing Grid code in Edinburgh, and 1 other UK scientists will joined the team.


It was  expected that during the hackathon for the team to port the shared memory regions of the code, presently used to communicate between MPI ranks on the same node, to use Nvlink between GPU’s in a single system and to use GPU Direct RDMA between nodes. Communication will be key for this code.

The Process

Structured Grid PDE solvers. Cartesian distributed arrays and finite difference operators. Iterative Krylov inverters, Conjugate Gradients. Multigrid. Markove Chain Monter Carlo.


Successfully implemented summation across GPU threads (formerly host only), and looked at Nvidia thrust reductions. Assessed whether these were reproducible. They implemented the first cut in lib/lattice/Lattice_reduction.h