DiRAC has procured 3 new systems situated on our four sites. The new Extreme Scaling at Edinburgh, Data Intensive at Cambridge and Leicester, and Memory Intensive at Durham.
Data Intensive System (DIaC,DIaL)
This service is a general purpose system for codes that are neither CPU centric or Memory centric. This service is spread across two sites, Cambridge (DIaC) and Lecister (DIaL). Due to the different mix of workloads for each system, both systems went through separate procurement processors.
The newest contribution to Cambridges growing CSD3 is the DiRAC’s DIaC system. With a requirement to run GRCHOMBO, MILC and AREPO codes, which are the dominant codes for DIaC. These codes were given equal waiting within the procurement process.
The DIal benchmarking team consisted of all stakeholders including Scientists, Research Software Engineers, System Architects and Vendors. During this benchmarking process both strong and weak scaling was considered. As Trove, SphNG and Ramses codes are the dominant on the DIaL system, they were given prominence with the benchmarking process.
Great emphasis was put on the maths kernel libraries, as they figure largely in Trove and other research codes. Alternative math libraries such as AMD’s AOCL and HPE’s CSML were compared to Intel’s MKL, which has been dominant in this area for a number of years. During benchmarking AOCL showed encouraging results. This was mirrored with CSML, but only at large core counts. It is recommended to users to investigate these libraries with their codes.
Overall AMD Rome gave the best performance for the Ramses and Trove benchmarks. The results for the SphNG benchmark showed little difference between AMD Rome and Intel Icelake. It was clear from the benchmarking that code performance was very close on a core for core comparison, but at a node level AMD offered more cores with no additional infrastructure.
Performance/cost was a major factor with all the DiRAC systems, but was clearly evident with DIaL. Due to some memory bound application highlighted during benchmarking it was decided that there was insufficient support for the slightly faster CPU, so the cheaper 2.25GHz 7742 was selected.
A design compromise was agreed, a reduction in interconnect enabled more compute resources. DIal has a 3:1 block interconnect. For codes with high interprocess transfers, this can be mitigated by recommending that users limit their codes to 30 nodes (3,840 cores) and submitting them to nodes on the same switch.
The DIaL system spec has:
- 25,600 AMD cores running at 2.25/3.4GHz
- 102TB of system memory
- 4TB file space.
- 200Gbps HDR IB 3:1 blocking interconnect
Each of the 200 nodes has:
- 2 * AMD EPIC ROME 7742 CPUs each with 64 cores giving 128 cores per node running at 2.25/3.4GHz
- 512GB of system memory, giving 3.9GB per CPU core.
- 200Gbps HDR IB interconnect
- Running CentOS7
Detailed information about compilation, and submission can be found here.
Extreme Scaling System (ES)
Based in Edinburgh and locally named ‘Tursa’, this system is dominated by the GRID team. This service aimed to provide a service for CPU intensive codes with relatively small data footprint per core, but with high data transfer.
In the past the Extreme scaling service always looked at initiative ways to satisfy their need for high numbers of CPUs. This included the BlueGene system (2012) and then adopting relatively low cost skylake CPUs arranged in a mesh configuration (2018).