Lots of project need lots of cores
The DIal benchmarking team consisted of all stakeholders including Scientists, Research Software Engineers, System Architects and Vendors. During this benchmarking process both strong and weak scaling was considered. As Trove, SphNG and Ramses codes are the dominant on the DIaL system, they were given prominence with the benchmarking process.
Great emphasis was put on the maths kernel libraries, as they figure largely in Trove and other research codes. Alternative math libraries such as AMD’s AOCL and HPE’s CSML were compared to Intel’s MKL, which has been dominant in this area for a number of years. During benchmarking AOCL showed encouraging results. This was mirrored with CSML, but only at large core counts. It is recommended to users to investigate these libraries with their codes.
Overall AMD Rome gave the best performance for the Ramses and Trove benchmarks. The results for the SphNG benchmark showed little difference between AMD Rome and Intel Icelake. It was clear from the benchmarking that code performance was very close on a core for core comparison, but at a node level AMD offered more cores with no additional infrastructure.
Performance/cost was a major factor with all the DiRAC systems, but was clearly evident with DIaL. Due to some memory bound application highlighted during benchmarking it was decided that there was insufficient support for the slightly faster CPU, so the cheaper 2.25GHz 7742 was selected.
A design compromise was agreed, a reduction in interconnect enabled more compute resources. DIal has a 3:1 block interconnect. For codes with high interprocess transfers, this can be mitigated by recommending that users limit their codes to 30 nodes (3,840 cores) and submitting them to nodes on the same switch.
The DIaL system spec has:
25,600 AMD cores running at 2.25/3.4GHz
102TB of system memory
200Gbps HDR IB 3:1 blocking interconnect
4TB file space.
Each of the 200 nodes has:
2 * AMD EPIC ROME 7742 CPUs each with 64 cores giving 128 cores per node running at 2.25/3.4GHz
512GB of system memory, giving 3.9GB per CPU core.
200Gbps HDR IB interconnect