Markdown

Continuous NCCL fabric benchmarking as a service¶

Scope: treating inter-node collective bandwidth as a standing, monitored signal rather than a one-time bring-up check. A long-lived service periodically benchmarks NCCL collectives across every node/GPU pair in a heterogeneous, multi-provider, or geo-distributed fleet, exposes the busbw/latency matrix as metrics, alerts on staleness and regression, and gates scheduling on a measured bandwidth floor. This is the ongoing counterpart to the one-shot fabric validation recipe and fabric bring-up benchmarking; the collective algorithms themselves are on NCCL collectives & algorithms.

What it is¶

In a uniform single-datacenter InfiniBand/RoCE fabric, inter-node bandwidth is roughly constant and you validate it once at commissioning. In a heterogeneous fleet (nodes across providers and regions, some stitched by a WireGuard overlay) the achievable bandwidth between any two nodes is variable and drifts over time: a spot node is replaced, a path congests, a mesh edge falls back from a direct peer to a hub relay. A continuous benchmarking service keeps a live map of "what bandwidth can these two nodes actually achieve right now." Two components:

A matrix runner: a scheduled job (nightly across the fleet, and as a per-deployment preflight) that runs NCCL collective benchmarks (all_reduce, all_gather, …) between node/GPU pairs and reports the result. The standard tool is nccl-tests, which reports algorithm bandwidth (algbw) and bus bandwidth (busbw) per message size. ([nccl-tests])
A collector service: a long-lived aggregator that ingests each BenchSample over HTTP, keeps the latest per slice in memory with a retention window, and exposes a /metrics endpoint for Prometheus. Samples are labeled by the dimensions that vary: source/destination provider, region, GPU model, the collective, the message size, and crucially the mesh type (direct vs hub-relay), so the same node pair can carry both series.

flowchart LR
  RUN["Matrix runner (nightly + preflight)<br/>nccl-tests per node pair"] -->|"POST BenchSample<br/>(busbw, latency, labels)"| COL["Collector service<br/>(in-memory, retention window)"]
  COL -->|"/metrics"| PROM["Prometheus"]
  PROM --> DASH["Heatmap: busbw per pair × size"]
  PROM --> ALERT["Alerts: staleness, regression"]
  PROM --> GATE["Preflight SLO gate:<br/>direct ≥ N× hub-relay?"]
  GATE -->|"pass"| SCHED["Schedule tightly-coupled job on pair"]
  GATE -->|"fail"| REJECT["Place elsewhere / warn"]

busbw vs algbw: measure the right number¶

nccl-tests reports two bandwidths. algbw is data_size / time. busbw multiplies algbw by a collective-specific factor (e.g. 2·(n−1)/n for all-reduce) so the result reflects the actual hardware bus traffic and is comparable across collectives and sizes. Compare paths and detect regressions on busbw; it is the apples-to-apples number. ([nccl-tests])

Why it matters¶

You cannot place or trust a tightly-coupled distributed-training job on a fabric whose real bandwidth you do not know. In a heterogeneous fleet, three problems are invisible to one-shot validation:

Placement. A job that all-reduces every step belongs on node pairs with high measured busbw, not on a pair that silently fell back to a hub relay at a fraction of the bandwidth. The live matrix is the input to that placement decision and to topology-aware scheduling.
Regression detection. A path that degraded last week (congestion, a provider change, a mesh fallback) shows up as a busbw drop on the 30-day series long before a user files a "training is slow" ticket. This is goodput protection.
SLO gating. A preflight gate can refuse to schedule a bandwidth-sensitive job on a pair that fails a floor (e.g. direct-mesh busbw must be ≥ N× the hub-relay path before a tightly-coupled run is allowed), turning fabric quality into an admission decision rather than a post-mortem.

A one-shot nccl-tests run at bring-up cannot catch drift, fallback, or per-pair variation, which is exactly what dominates a multi-provider fleet.

When to use it (and when not)¶

Use a continuous benchmarking service when:

The fleet is heterogeneous, multi-provider, or geo-distributed, so inter-node bandwidth varies across pairs and drifts over time.
You make placement or admission decisions that depend on real fabric quality (which pairs can host a tightly-coupled job).
You need regression detection on the fabric as a standing SLI, not a one-time number.

Do not build it when:

The fabric is a uniform, stable single-DC IB/RoCE fabric, where bandwidth is constant; a one-shot validation plus standard fabric counters (fabric bring-up, observability) is sufficient and a continuous matrix is overkill.
The cluster is small. An n² pair matrix is cheap at small n but grows fast; for a handful of nodes, periodic spot checks beat a full service.
You have no consumer for the data. Metrics nobody places or alerts on are toil.

How to build and operate it¶

Reference shapes, unexecuted. Confirm nccl-tests flags, the busbw factor for each collective, and Prometheus exposition against current docs; calibrate floors against your own measured baselines.

1. Run the benchmark matrix on a schedule¶

Drive nccl-tests between pairs from a scheduled job: nightly across the fleet, plus a preflight before a bandwidth-sensitive deployment. Keep runs short and bounded.

# Per-pair all-reduce sweep; parse busbw per size from the output (illustrative).
all_reduce_perf -b 8 -e 512M -f 2 -g "${GPUS_PER_NODE}"
# Emit one BenchSample per (collective, size): busbw, algbw, latency, + topology labels.

2. Collect into a labeled, time-windowed service¶

A small HTTP service ingests samples and exposes Prometheus metrics. Keep the latest sample per slice in memory, drop samples older than the retention window, and re-converge on restart (it is a cache, not a database).

# Exposition shape (illustrative) — the labels are what make a heterogeneous fleet legible.
fabric_nccl_busbw_gbps{collective="all_reduce",size_bytes="16777216",
  src_provider="A",src_region="us",dst_provider="B",dst_region="eu",
  src_gpu_model="H100",dst_gpu_model="H100",mesh="direct"}  142.3
fabric_nccl_latency_us{...same labels...,mesh="hub-relay"}   910
fabric_nccl_bench_last_success_timestamp_seconds            1.75e9
fabric_nccl_bench_run_total{outcome="success|timeout|error|skipped"}  ...

3. Track freshness explicitly¶

A measurement's value decays. Tier each slice (e.g. Fresh (<24 h), Stale (24 h–7 d), Uncharacterized (>retention)) and surface the tier so a consumer never treats a month-old number as current. Export last_success_timestamp so you can alert when the pipeline itself has stopped producing data.

4. Dashboard and alert¶

Heatmap of busbw per pair × message size, the researcher-facing view of fabric quality.
Staleness alert on time() - last_success_timestamp > threshold: the bench pipeline is broken and the heatmap is aging out.
Failure-rate alert on increase(bench_run_total{outcome=~"timeout|error"}), distinguishing a real failure from a deliberate skipped (e.g. a pair under GPU pressure that should not be benchmarked mid-run; do not count it as a failure).
Regression alert on a sustained busbw drop on a path over the long window.

5. Gate scheduling on the measured floor¶

Feed the matrix into placement: refuse, or warn, when a pair's measured busbw is below the floor a job needs, and prefer high-busbw direct-mesh pairs for tightly-coupled work. This is where the service pays for itself: fabric quality becomes an admission input (SLO/SLI catalog, topology-aware scheduling).

Failure modes¶

Stale matrix treated as current. A broken runner leaves a confident-looking heatmap that is weeks old. Export and alert on last_success_timestamp; tier by freshness.
Benchmarking under load counted as failure. A pair busy with a real job is correctly skipped, not failed; folding skips into the failure rate creates false alarms.
Single global threshold on a heterogeneous fleet. One latency/bandwidth cutoff mislabels good and bad pairs alike; thresholds must be per-class (GPU model, mesh type).
Confusing algbw and busbw. Comparing raw algbw across collectives/sizes is apples-to-oranges; normalize on busbw.
Direct vs hub-relay conflation. Dropping the mesh label hides that a pair silently fell back to a slow relay. Keep both series.
Matrix cost blowup. With n² pairs at large n, or too-frequent runs, you burn GPU time. Sample, stagger, and bound run duration.

Open questions & validation¶

The right benchmark cadence and pair-sampling strategy so the matrix is fresh without an n² GPU-time bill.
Per-class busbw floors (GPU model × collective × size) calibrated from measured baselines, not datasheet peak.
Whether the preflight gate's floor (e.g. direct ≥ N× hub-relay) actually predicts training throughput on that pair; validate against a real run.
Collector behaviour on restart and under burst ingest (it is a cache; confirm re-convergence within one matrix cycle).
Integration with placement/scheduling: does the scheduler consume the matrix, or is it only observability?

References¶

nccl-tests (NVIDIA — all_reduce_perf etc.; algbw vs busbw definition and the per-collective bus factor): https://github.com/NVIDIA/nccl-tests
NCCL collectives and tuning: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html
NCCL environment variables (transport selection, NCCL_SOCKET_IFNAME): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
Prometheus — metric types and exposition format: https://prometheus.io/docs/concepts/metric_types/