Meridian HPC — Veyra

Overview

The Meridian cluster has grown from a 180-node CPU system commissioned in 2001 alongside the Calder Mesa Campus opening to its current form: 2,400 GPU nodes deployed in the 2024 expansion, making it one of the largest non-governmental scientific compute facilities in the region. It serves all five of Veyra's research divisions and the Software & AI Services team, and provides external allocation capacity to academic and industry users.

The cluster runs on SLURM 23.11, with partitions segmented by workload type (interactive, batch, high-memory, GPU-only, and infiniband-required). The interconnect fabric uses HDR InfiniBand at 200 Gb/s. The parallel filesystem (BeeGFS, 2.4 PB usable, 480 GB/s aggregate read bandwidth) is shared across all nodes and provides the primary working storage for running jobs.

The facility is managed by a team of five HPC systems engineers under Dr. Calanthe Ordóñez, Head of Research Computing. User support is provided via a ticketing system (hpc-support@veyra.example) with a first-response SLA of four working hours.

Cluster specifications

Meridian HPC cluster specifications as of January 2025. Node counts reflect the January 2025 configuration following the 2024 expansion.
Component	Specification
Total GPU nodes	2,400 (NVIDIA H100 SXM5, 80 GB HBM3 per GPU; 4 GPUs per node)
Total GPU count	9,600 NVIDIA H100 SXM5
CPU per node	2× AMD EPYC 9654 (96 cores total per node)
RAM per node	768 GB DDR5-4800 ECC
Interconnect fabric	HDR InfiniBand 200 Gb/s, fat-tree topology, non-blocking
Network bandwidth	200 Gb/s per node (HDR InfiniBand)
Parallel filesystem	BeeGFS 7.4; 2.4 PB usable; 480 GB/s aggregate read bandwidth
Archive storage	18 PB nearline tape (Spectra Logic TFinity); accessible via HSM
Login nodes	8 dedicated login nodes (AMD EPYC 9474F, 512 GB RAM each)
Scheduler	SLURM 23.11.3; fair-share scheduling with priority decay
OS	Rocky Linux 9.3 (all nodes); kernel 6.1 LTS
Peak FP64 performance	~118 PFLOPS (theoretical; all GPU nodes)
Power draw (full load)	~6.2 MW; PUE 1.28; water-cooled (rear-door heat exchangers)
Network uplink	100 GbE to campus backbone; 10 GbE to external internet

Software stack

All software is available via module files (module avail on login). The following are pre-installed and maintained by the HPC team.

Compilers & runtimes

GCC 13.2, 12.3 (default)
Intel oneAPI 2024.1 (icc, ifort, icpx)
NVCC (CUDA 12.3)
ROCm 6.0 (AMD GPU support)
Python 3.11, 3.10 (conda-managed)
Julia 1.10

MPI & parallel libraries

OpenMPI 5.0 (InfiniBand-optimized)
MPICH 4.1
Intel MPI 2021.11
UCX 1.16 (RDMA transport)
NCCL 2.19 (GPU collective comms)

Scientific applications

GROMACS 2024.1 (GPU-accelerated)
LAMMPS 2024.08 (Kokkos GPU backend)
NAMD 3.0b6
VASP 6.4.2 (licensed; apply via HPC team)
Quantum ESPRESSO 7.3
OpenFOAM v2312
FEniCSx 0.8

ML & data science

PyTorch 2.2 (CUDA 12.3)
TensorFlow 2.16
JAX 0.4.25
Hugging Face Transformers 4.40
veyra-atlas 2.1 (Veyra Atlas package)
RAPIDS 24.04 (GPU-accelerated data science)
Dask, Ray (distributed compute)

Cryo-EM & imaging

RELION 5.0
cryoSPARC 4.5
CTFFIND4
MotionCor3
Phenix 1.21
UCSF Chimera X 1.7

Workflow & MLOps

Snakemake 8.5
Nextflow 24.04
MLflow 2.13
DVC 3.50
Singularity/Apptainer 1.3
Podman (rootless containers)

Allocation policy

Internal allocation

All Veyra Institute researchers and graduate students receive a baseline allocation of 2,000 node-hours per quarter, renewed automatically. Additional allocation is available through a competitive application process reviewed quarterly by the Research Computing Committee. Applications require a brief technical justification (one page) and an estimate of expected outputs.

Research groups with active external grants may apply for project-specific allocations that run for the duration of the grant. These are not drawn from the baseline pool and are not subject to quarterly caps.

External allocation

External users book compute time at 44 cr/hr per node through the Facilities Portal (booking code HPC-MER). A minimum booking of 4 node-hours applies; there is no maximum per booking, but allocations above 512 nodes require prior notification to the HPC team. External bookings are billed in arrears against the registered account.

Academic users from accredited institutions receive the standard 15% external user discount. Frame agreements for sustained external users are available for clients expecting to use more than 50,000 node-hours per year.

Fair-share scheduling

SLURM uses fair-share scheduling with a 7-day decay half-life. Users who have consumed less than their fair share receive priority; users who have over-consumed are deprioritised. The system converges to fair usage within approximately 48 hours of an imbalance.

A dedicated partition (priority) is available for time-sensitive jobs. Priority partition time is limited to 20% of each user's quarterly allocation and incurs a 1.5× billing multiplier for internal users (external users pay the standard rate regardless of partition).

Job limits

Default per-user limits: 512 concurrent nodes, 7-day walltime maximum. Jobs requiring more than 512 nodes or longer walltimes are accommodated via a reservation request submitted to the HPC team with at least 72 hours advance notice. Maintenance windows (one per quarter, typically Sunday 00:00–08:00) are published on the HPC status page six weeks in advance.

Data management

Home directories (100 GB quota) are backed up nightly. Scratch space (/scratch) is purged after 30 days and is not backed up. Project directories (allocation-based quota) are retained for the duration of the allocation plus 90 days. Long-term archiving to the tape library is available via an HSM policy on request.

2,400

GPU nodes

NVIDIA H100 SXM5, 80 GB HBM3

118 PF

Peak FP64

Theoretical (all GPU nodes)

2.4 PB

Parallel storage

BeeGFS, 480 GB/s read bandwidth

44 cr

External rate/hr/node

Via HPC-MER booking code

Related resources

Service

Software & AI Services

Custom ML, MLOps, and Veyra Atlas — all running on Meridian. Managed compute engagements available.

View service →

Service

Computational Modeling-as-a-Service

MD, DFT, and FEA simulations run on Meridian by Veyra scientists as a managed service.

Scientific services →

Service

Equipment Rental

Self-service HPC node rental at 44 cr/hr (external). Booking via Facilities Portal.

View rates →