Contact

Overview

The Meridian cluster has grown from a 180-node CPU system commissioned in 2001 alongside the Calder Mesa Campus opening to its current form: 2,400 GPU nodes deployed in the 2024 expansion, making it one of the largest non-governmental scientific compute facilities in the region. It serves all five of Veyra's research divisions and the Software & AI Services team, and provides external allocation capacity to academic and industry users.

The cluster runs on SLURM 23.11, with partitions segmented by workload type (interactive, batch, high-memory, GPU-only, and infiniband-required). The interconnect fabric uses HDR InfiniBand at 200 Gb/s. The parallel filesystem (BeeGFS, 2.4 PB usable, 480 GB/s aggregate read bandwidth) is shared across all nodes and provides the primary working storage for running jobs.

The facility is managed by a team of five HPC systems engineers under Dr. Calanthe Ordóñez, Head of Research Computing. User support is provided via a ticketing system (hpc-support@veyra.example) with a first-response SLA of four working hours.

Cluster specifications

Meridian HPC cluster specifications as of January 2025. Node counts reflect the January 2025 configuration following the 2024 expansion.
ComponentSpecification
Total GPU nodes2,400 (NVIDIA H100 SXM5, 80 GB HBM3 per GPU; 4 GPUs per node)
Total GPU count9,600 NVIDIA H100 SXM5
CPU per node2× AMD EPYC 9654 (96 cores total per node)
RAM per node768 GB DDR5-4800 ECC
Interconnect fabricHDR InfiniBand 200 Gb/s, fat-tree topology, non-blocking
Network bandwidth200 Gb/s per node (HDR InfiniBand)
Parallel filesystemBeeGFS 7.4; 2.4 PB usable; 480 GB/s aggregate read bandwidth
Archive storage18 PB nearline tape (Spectra Logic TFinity); accessible via HSM
Login nodes8 dedicated login nodes (AMD EPYC 9474F, 512 GB RAM each)
SchedulerSLURM 23.11.3; fair-share scheduling with priority decay
OSRocky Linux 9.3 (all nodes); kernel 6.1 LTS
Peak FP64 performance~118 PFLOPS (theoretical; all GPU nodes)
Power draw (full load)~6.2 MW; PUE 1.28; water-cooled (rear-door heat exchangers)
Network uplink100 GbE to campus backbone; 10 GbE to external internet

Software stack

All software is available via module files (module avail on login). The following are pre-installed and maintained by the HPC team.

Compilers & runtimes

  • GCC 13.2, 12.3 (default)
  • Intel oneAPI 2024.1 (icc, ifort, icpx)
  • NVCC (CUDA 12.3)
  • ROCm 6.0 (AMD GPU support)
  • Python 3.11, 3.10 (conda-managed)
  • Julia 1.10

MPI & parallel libraries

  • OpenMPI 5.0 (InfiniBand-optimized)
  • MPICH 4.1
  • Intel MPI 2021.11
  • UCX 1.16 (RDMA transport)
  • NCCL 2.19 (GPU collective comms)

Scientific applications

  • GROMACS 2024.1 (GPU-accelerated)
  • LAMMPS 2024.08 (Kokkos GPU backend)
  • NAMD 3.0b6
  • VASP 6.4.2 (licensed; apply via HPC team)
  • Quantum ESPRESSO 7.3
  • OpenFOAM v2312
  • FEniCSx 0.8

ML & data science

  • PyTorch 2.2 (CUDA 12.3)
  • TensorFlow 2.16
  • JAX 0.4.25
  • Hugging Face Transformers 4.40
  • veyra-atlas 2.1 (Veyra Atlas package)
  • RAPIDS 24.04 (GPU-accelerated data science)
  • Dask, Ray (distributed compute)

Cryo-EM & imaging

  • RELION 5.0
  • cryoSPARC 4.5
  • CTFFIND4
  • MotionCor3
  • Phenix 1.21
  • UCSF Chimera X 1.7

Workflow & MLOps

  • Snakemake 8.5
  • Nextflow 24.04
  • MLflow 2.13
  • DVC 3.50
  • Singularity/Apptainer 1.3
  • Podman (rootless containers)

Allocation policy

Internal allocation

All Veyra Institute researchers and graduate students receive a baseline allocation of 2,000 node-hours per quarter, renewed automatically. Additional allocation is available through a competitive application process reviewed quarterly by the Research Computing Committee. Applications require a brief technical justification (one page) and an estimate of expected outputs.

Research groups with active external grants may apply for project-specific allocations that run for the duration of the grant. These are not drawn from the baseline pool and are not subject to quarterly caps.

External allocation

External users book compute time at 44 cr/hr per node through the Facilities Portal (booking code HPC-MER). A minimum booking of 4 node-hours applies; there is no maximum per booking, but allocations above 512 nodes require prior notification to the HPC team. External bookings are billed in arrears against the registered account.

Academic users from accredited institutions receive the standard 15% external user discount. Frame agreements for sustained external users are available for clients expecting to use more than 50,000 node-hours per year.

Fair-share scheduling

SLURM uses fair-share scheduling with a 7-day decay half-life. Users who have consumed less than their fair share receive priority; users who have over-consumed are deprioritised. The system converges to fair usage within approximately 48 hours of an imbalance.

A dedicated partition (priority) is available for time-sensitive jobs. Priority partition time is limited to 20% of each user's quarterly allocation and incurs a 1.5× billing multiplier for internal users (external users pay the standard rate regardless of partition).

Job limits

Default per-user limits: 512 concurrent nodes, 7-day walltime maximum. Jobs requiring more than 512 nodes or longer walltimes are accommodated via a reservation request submitted to the HPC team with at least 72 hours advance notice. Maintenance windows (one per quarter, typically Sunday 00:00–08:00) are published on the HPC status page six weeks in advance.

Data management

Home directories (100 GB quota) are backed up nightly. Scratch space (/scratch) is purged after 30 days and is not backed up. Project directories (allocation-based quota) are retained for the duration of the allocation plus 90 days. Long-term archiving to the tape library is available via an HSM policy on request.

2,400
GPU nodes
NVIDIA H100 SXM5, 80 GB HBM3
118 PF
Peak FP64
Theoretical (all GPU nodes)
2.4 PB
Parallel storage
BeeGFS, 480 GB/s read bandwidth
44 cr
External rate/hr/node
Via HPC-MER booking code