DATA INTENSIVE SERVICE
(CAMBRIDGE)

DATA INTENSIVE at cambridge (diac)

The Data Intensive service provides a set of hardware options for projects from across all DiRAC research domains, driving scientific discovery by delivering a step-change in our capability to handle large datasets, both to perform and analyse precision theoretical simulations and then to confront them with the next generation of observational and experimental data. Such diverse workflows are best supported by a heterogeneous mix of architectures delivered across two DiRAC sites: Cambridge and Leicester. 

System name: DIaC part of CSD3 (Cambridge Service for Data-Driven Discovery)

Many DiRAC projects explore high-dimensional parameter spaces using statistical techniques generating large numbers of computationally-intensive models. There is also an increasing use of GPU acceleration in simulations, either to support post-processing of simulation data or to make use of AI-driven models. The Cambridge service supports these workflows through a mix of CPU and GPU nodes sharing a common parallel file system to ensure that workflows can use both architectures. 

The Cambridge system uses OpenStack for deployment and presentation of services. On-going work with UK-based SME StackHPC will enable DiRAC users to explore the potential benefits to their workflows, including the long-term goal of supporting workflows requiring access to more than one DiRAC service for their efficient completion. 

The benchmark codes used for the design and testing of the CSD3 system were: 

MILC – a particle physics QCD code, providing key results informing ongoing experiments at the precision frontier. The expensive step is calculating propagators for light quarks on gluon field backgrounds defined on large, fine space-time lattices. The output is stored for subsequent re-analysis, making I/O performance a key requirement for this work. 

Arepo – a code used for cosmological zoom simulations to explore the inner regions of galaxies at high resolutions. Some outputs from running Arepo on CPUs are later processed using GPUs to add the effects of radiation.  

GRChombo  a adaptive mesh refinement (AMR) numerical relativity code wirth applications ranging from early universe cosmology to black hole mergers producing observable gravitational wave signatures. 

WILKES 3

DIRAC HAS ACCESS TO A SHARE OF 100 NODES, EACH WITH 4X A100 GPUs, DUAL 64-CORE AMD MILAN PROCESSORS & 1TB RAM

THE HPC INTERCONNECT:
INTEL OMNIPATH, 2:1 BLOCKING (SKYLAKE) MELLANOX HDR INFINIBAND, 3:1 BLOCKING (CASCADE LAKE, ICE LAKE AND WILKES-3)
STORAGE
STORAGE CONSISTS OF 23PiB OF DISK STORAGE CONFIGURES AS MULTIPLE LUSTRE PARALLEL FILESYSTEMS, OF WHICH DIRAC HAS ACCESS TO 4.8 PiB
OPERATING SYSTEM
RESOURCE MANAGEMENT IS PERFORMED BY SLURM

DATA MANAGEMENT PLAN

CUMULUS

544 ICE LAKE CPU NODES EACH WITH 2 x INTEL XEON PLATINUM 8368Q PROCESSORS, 2.60GHz 38-CORE (76 CORES PER NODE): 

428 NODES WITH 256 GiB MEMORY 116 NODES WITH 512 GiB MEMORY DIRAC HAS A SHARE OF 267 NODES (20,292 cores) NETWORK FABRIC OF HDR 3:1 BLOCKING

672 CASCADE LAKE CPU NODES EACH WITH 2 x INTEL XEON PLATINUM 8276 PROCESSORS,
2.6GHz 28-CORE (56 CODES PER NODE): 

616 NODES WITH 192 GiB MEMORY 56 NODES WITH 384 GiB MEMORY DIRAC HAS A SHARE OF 119 NODES (6664 cores) NETWORK FABRIC OF HDR100 3:1

100 AMPER GPU NODES (WILKES-3) EACH WITH 4x NVIDIA A100-SXM-80GB GPUs, 2x AMD EPYC 7763 PROCESSORS, 1.8GHz 64-CORE (128 CORES PER NODE), 1TiB RAM

SITE SPECIFIC USER GUIDE

Our site specific user guide, hosted by The University of Cambridge, contains a full user guide as well as a list of applications of the CSD3 system

SCIENCE ON DATA INTENSIVE SERVICE: CAMBRIDGE