Continuous NCCL fabric benchmarking as a service¶
Scope: treating inter-node collective bandwidth as a standing, monitored signal rather than a one-time bring-up check. A long-lived service periodically benchmarks NCCL collectives across every node/GPU pair in a heterogeneous, multi-provider, or geo-distributed fleet, exposes the busbw/latency matrix as metrics, alerts on staleness and regression, and gates scheduling on a measured bandwidth floor. This is the ongoing counterpart to the one-shot fabric validation recipe and fabric bring-up benchmarking; the collective algorithms themselves are on NCCL collectives & algorithms.
What it is¶
In a uniform single-datacenter InfiniBand/RoCE fabric, inter-node bandwidth is roughly constant and you validate it once at commissioning. In a heterogeneous fleet (nodes across providers and regions, some stitched by a WireGuard overlay) the achievable bandwidth between any two nodes is variable and drifts over time: a spot node is replaced, a path congests, a mesh edge falls back from a direct peer to a hub relay. A continuous benchmarking service keeps a live map of "what bandwidth can these two nodes actually achieve right now." Two components:
- A matrix runner: a scheduled job (nightly across the fleet, and as a per-deployment preflight) that runs NCCL collective benchmarks (
all_reduce,all_gather, …) between node/GPU pairs and reports the result. The standard tool is nccl-tests, which reports algorithm bandwidth (algbw) and bus bandwidth (busbw) per message size. ([nccl-tests]) - A collector service: a long-lived aggregator that ingests each
BenchSampleover HTTP, keeps the latest per slice in memory with a retention window, and exposes a/metricsendpoint for Prometheus. Samples are labeled by the dimensions that vary: source/destination provider, region, GPU model, the collective, the message size, and crucially the mesh type (directvshub-relay), so the same node pair can carry both series.
flowchart LR
RUN["Matrix runner (nightly + preflight)<br/>nccl-tests per node pair"] -->|"POST BenchSample<br/>(busbw, latency, labels)"| COL["Collector service<br/>(in-memory, retention window)"]
COL -->|"/metrics"| PROM["Prometheus"]
PROM --> DASH["Heatmap: busbw per pair × size"]
PROM --> ALERT["Alerts: staleness, regression"]
PROM --> GATE["Preflight SLO gate:<br/>direct ≥ N× hub-relay?"]
GATE -->|"pass"| SCHED["Schedule tightly-coupled job on pair"]
GATE -->|"fail"| REJECT["Place elsewhere / warn"]
busbw vs algbw: measure the right number¶
nccl-tests reports two bandwidths. algbw is data_size / time. busbw multiplies algbw by a collective-specific factor (e.g. 2·(n−1)/n for all-reduce) so the result reflects the actual hardware bus traffic and is comparable across collectives and sizes. Compare paths and detect regressions on busbw; it is the apples-to-apples number. ([nccl-tests])
Why it matters¶
You cannot place or trust a tightly-coupled distributed-training job on a fabric whose real bandwidth you do not know. In a heterogeneous fleet, three problems are invisible to one-shot validation:
- Placement. A job that all-reduces every step belongs on node pairs with high measured busbw, not on a pair that silently fell back to a hub relay at a fraction of the bandwidth. The live matrix is the input to that placement decision and to topology-aware scheduling.
- Regression detection. A path that degraded last week (congestion, a provider change, a mesh fallback) shows up as a busbw drop on the 30-day series long before a user files a "training is slow" ticket. This is goodput protection.
- SLO gating. A preflight gate can refuse to schedule a bandwidth-sensitive job on a pair that fails a floor (e.g. direct-mesh busbw must be ≥ N× the hub-relay path before a tightly-coupled run is allowed), turning fabric quality into an admission decision rather than a post-mortem.
A one-shot nccl-tests run at bring-up cannot catch drift, fallback, or per-pair variation, which is exactly what dominates a multi-provider fleet.
When to use it (and when not)¶
Use a continuous benchmarking service when:
- The fleet is heterogeneous, multi-provider, or geo-distributed, so inter-node bandwidth varies across pairs and drifts over time.
- You make placement or admission decisions that depend on real fabric quality (which pairs can host a tightly-coupled job).
- You need regression detection on the fabric as a standing SLI, not a one-time number.
Do not build it when:
- The fabric is a uniform, stable single-DC IB/RoCE fabric, where bandwidth is constant; a one-shot validation plus standard fabric counters (fabric bring-up, observability) is sufficient and a continuous matrix is overkill.
- The cluster is small. An
n²pair matrix is cheap at smallnbut grows fast; for a handful of nodes, periodic spot checks beat a full service. - You have no consumer for the data. Metrics nobody places or alerts on are toil.
How to build and operate it¶
Reference shapes, unexecuted. Confirm nccl-tests flags, the busbw factor for each collective, and Prometheus exposition against current docs; calibrate floors against your own measured baselines.
1. Run the benchmark matrix on a schedule¶
Drive nccl-tests between pairs from a scheduled job: nightly across the fleet, plus a preflight before a bandwidth-sensitive deployment. Keep runs short and bounded.
# Per-pair all-reduce sweep; parse busbw per size from the output (illustrative).
all_reduce_perf -b 8 -e 512M -f 2 -g "${GPUS_PER_NODE}"
# Emit one BenchSample per (collective, size): busbw, algbw, latency, + topology labels.
2. Collect into a labeled, time-windowed service¶
A small HTTP service ingests samples and exposes Prometheus metrics. Keep the latest sample per slice in memory, drop samples older than the retention window, and re-converge on restart (it is a cache, not a database).
# Exposition shape (illustrative) — the labels are what make a heterogeneous fleet legible.
fabric_nccl_busbw_gbps{collective="all_reduce",size_bytes="16777216",
src_provider="A",src_region="us",dst_provider="B",dst_region="eu",
src_gpu_model="H100",dst_gpu_model="H100",mesh="direct"} 142.3
fabric_nccl_latency_us{...same labels...,mesh="hub-relay"} 910
fabric_nccl_bench_last_success_timestamp_seconds 1.75e9
fabric_nccl_bench_run_total{outcome="success|timeout|error|skipped"} ...
3. Track freshness explicitly¶
A measurement's value decays. Tier each slice (e.g. Fresh (<24 h), Stale (24 h–7 d), Uncharacterized (>retention)) and surface the tier so a consumer never treats a month-old number as current. Export last_success_timestamp so you can alert when the pipeline itself has stopped producing data.
4. Dashboard and alert¶
- Heatmap of busbw per pair × message size, the researcher-facing view of fabric quality.
- Staleness alert on
time() - last_success_timestamp > threshold: the bench pipeline is broken and the heatmap is aging out. - Failure-rate alert on
increase(bench_run_total{outcome=~"timeout|error"}), distinguishing a real failure from a deliberateskipped(e.g. a pair under GPU pressure that should not be benchmarked mid-run; do not count it as a failure). - Regression alert on a sustained busbw drop on a path over the long window.
5. Gate scheduling on the measured floor¶
Feed the matrix into placement: refuse, or warn, when a pair's measured busbw is below the floor a job needs, and prefer high-busbw direct-mesh pairs for tightly-coupled work. This is where the service pays for itself: fabric quality becomes an admission input (SLO/SLI catalog, topology-aware scheduling).
Failure modes¶
- Stale matrix treated as current. A broken runner leaves a confident-looking heatmap that is weeks old. Export and alert on
last_success_timestamp; tier by freshness. - Benchmarking under load counted as failure. A pair busy with a real job is correctly skipped, not failed; folding skips into the failure rate creates false alarms.
- Single global threshold on a heterogeneous fleet. One latency/bandwidth cutoff mislabels good and bad pairs alike; thresholds must be per-class (GPU model, mesh type).
- Confusing algbw and busbw. Comparing raw algbw across collectives/sizes is apples-to-oranges; normalize on busbw.
- Direct vs hub-relay conflation. Dropping the
meshlabel hides that a pair silently fell back to a slow relay. Keep both series. - Matrix cost blowup. With
n²pairs at largen, or too-frequent runs, you burn GPU time. Sample, stagger, and bound run duration.
Open questions & validation¶
- The right benchmark cadence and pair-sampling strategy so the matrix is fresh without an
n²GPU-time bill. - Per-class busbw floors (GPU model × collective × size) calibrated from measured baselines, not datasheet peak.
- Whether the preflight gate's floor (e.g. direct ≥ N× hub-relay) actually predicts training throughput on that pair; validate against a real run.
- Collector behaviour on restart and under burst ingest (it is a cache; confirm re-convergence within one matrix cycle).
- Integration with placement/scheduling: does the scheduler consume the matrix, or is it only observability?
References¶
- nccl-tests (NVIDIA —
all_reduce_perfetc.; algbw vs busbw definition and the per-collective bus factor): https://github.com/NVIDIA/nccl-tests - NCCL collectives and tuning: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html
- NCCL environment variables (transport selection,
NCCL_SOCKET_IFNAME): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html - Prometheus — metric types and exposition format: https://prometheus.io/docs/concepts/metric_types/
Related: Recipe: Fabric Validation (nccl-tests) · Fabric Bring-Up, Validation & Benchmarking · NCCL Collectives & Algorithms · Overlay & Mesh Networking · Topology-Aware K8s Scheduling · Goodput · Observability & Monitoring · Glossary