More cores, more science
Based in Edinburgh and locally named ‘Tursa’, this system is dominated by the GRID team. This service aimed to provide a service for computational intensive codes with relatively small data footprint per core, but with high data transfer.
In the past the Extreme scaling service always looked at initiative ways to satisfy their need for high numbers of CPUs. This included the BlueGene system (2012) and then adopting relatively low cost skylake CPUs arranged in a mesh configuration (2018).
Since the first DiRAC Hackathon in 2018, the Grid team has been investigating the possibility of using GPUs with the GRID code. This has resulted in impressive improvements in performance and was a major factor in the Extreme Scaling
system moving to a GPU based service.
Code performance was not the only factor. In this present climate building strong partnerships is key to success and success in the future. Added value such as internships for students and postdocs, supervising joint PhDs were all considered. As was the environmental credentials of any future system such as the Smart Energy Management Suit from ATOS.
The winning bid from ATOS consisted of AMD based servers with 4 * A100 Nvidia GPU cards and with Melanox HDR network. This is connected to a DDN EXAscaler 4PB file system and 8PB tape system.
The ES system spec has:
CPU
14592 AMD CPU cores running at 2.6/3.3GHz.
MEMORY
114TB of system memory
INTERCONNECT
200Gbps HDR IB non-blocking interconnect.
DATA
14592 AMD CPU cores running at 2.6/3.3GHz.
To support the required workloads each of the 112 nodes has:
CPU
2 * AMD EPIC ROME 7H12 CPUs each with 64 cores giving 128 cores per node running at 2.6/3.3GHz
MEMORY
1TB of system memory, giving 7.8GB per CPU core.
INTERCONNECT
The GPU cards are connected via NVLink giving memory transfer speed between cards of 4800Gbps.
GPU
4 * NVIDIA A100 GPU cards each with 6912 FP32 CUDA cores, 40GB on board memory, and 432 tensor cores running at 765/1410MHz. Giving 27,648 cuda cores and 160GB of GPU memory.