Markdown

Runbook: topology-unaware scheduling starvation¶

Scope: a tightly-coupled training job runs but crawls because its ranks landed scattered across the spine instead of rail-local on the fewest leaf switches, so every collective pays extra hops. Re-place the job under topology constraints and prove the bus bandwidth recovers.

Reference templates on real APIs. Nothing here was executed on hardware; pin versions to your Slurm / Kueue / NCCL release, substitute real node and switch names, and validate on one job before fleet use.

This is the placement-quality counterpart to the Slurm topology placement reference (how topology.conf, --switches, and the tree/block plugins model the fabric) and the fabric bring-up / benchmarking procedure (how to read busbw and confirm the transport). It is distinct from a fabric fault: here the fabric is healthy and the job is slow purely because the scheduler spread the ranks. If a collective fully stalls (step time goes to infinity, no progress) that is the NCCL-hang runbook, not this. The scheduler internals themselves live on Slurm, Kubernetes, and k3s; link them, do not re-derive them here.

The failure mode: synchronous data/model-parallel training is bounded by its slowest link. The all-reduce in networking fabric is rail-sensitive. Ranks on the same leaf switch exchange gradients in one hop; ranks split across leaves traverse leaf -> spine -> leaf for every step. A topology-unaware allocation does not fail, it just taxes every iteration, so it surfaces as an MFU regression with comms as the dominant phase, not as an outage.

Trigger¶

Step time / tokens-per-sec-per-GPU below baseline on the same model + parallelism + node count, with the NCCL/comms phase dominant in the profile (MFU regression, observability, SLO/SLI catalog). Compute is fine; the all-reduce is slow.
nccl-tests busbw across the job's actual nodes is well below the expected inter-node figure for the fabric, even though point-to-point links are healthy (fabric bring-up / benchmarking).
The job spans more leaf switches than it needs. An 8-node job that could sit on one or two leaves is instead smeared across several leaves and the spine, visible in scontrol show topology / scontrol -d show job (Slurm) or in the nodes' topology labels (Kubernetes).
Regression appears after a scheduling change: a new partition, a removed or stale topology.conf, a job submitted without --switches, or a K8s workload submitted without topology-aware scheduling (Slurm topology placement).

Pre-checks¶

Establish that this is a placement problem and not a fault, before changing any constraint.

A baseline must exist. Without a recorded step-time / busbw baseline for this exact model + parallelism + node count, there is nothing to regress against, so establish one first (MFU regression, SLO/SLI catalog).
Rule out a fabric fault first. A degraded link or a down subnet manager mimics topology starvation. Confirm links are up and the SM is converged before blaming placement (fabric bring-up / benchmarking); on NVSwitch nodes confirm Fabric Manager is active (Fabric Manager failure runbook). If a collective is fully wedged, divert to the NCCL-hang runbook.
Read the actual placement (Slurm). scontrol show topology displays the configured switch/block layout; node=<name> reports the units a node connects to. -d adds per-node allocation detail to the job.³²
```
JOBID=123456
scontrol -d show job "$JOBID"                       # NodeList, BatchHost, per-node alloc
scontrol show hostnames "$(scontrol show job "$JOBID" -o | grep -oP 'NodeList=\K\S+')"
scontrol show topology node=gpu001                  # switches/blocks this node hangs off
```
Map each allocated node to its leaf switch from topology.conf; if the job's nodes fan out across many leaves and a spine, that is the starvation signature (Slurm topology placement).
Read the path NCCL actually built. Dump the detected topology graph and confirm the inter-node transport is GDR-capable IB/RoCE, not a TCP fallback. NCCL_TOPO_DUMP_FILE writes the detected XML topology after detection, and NCCL_DEBUG_SUBSYS filters NCCL_DEBUG=INFO to the GRAPH/INIT/NET subsystems.⁴⁵
```
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,NET \
  NCCL_TOPO_DUMP_FILE=/tmp/nccl-topo.xml <launch cmd> 2>&1 | grep -E "NET/IB|GDRDMA|NET/Socket"
```
NET/IB/.../GDRDMA on the inter-node hops is the intended path; a NET/Socket line where IB was expected is a transport misconfiguration to fix first (NCCL-hang runbook, performance tuning). That is a different bug from topology spread, and masks it.
Confirm the constraint surface exists. Slurm: topology/tree (or topology/block) is set in slurm.conf and topology.conf is present and current. Kubernetes: a Topology object and the TopologyAwareScheduling feature gate are in place (Slurm topology placement). If the fabric model is missing, no constraint can help until it is declared.

Procedure¶

Re-placement means evicting and re-launching the job under topology constraints. Cordon/requeue before mutating; never edit a running allocation in place. Constrain on the smallest domain that still fits the job (one leaf if it fits, else the fewest leaves under one spine).

JOBID=123456
PART=blackwell
NODES=8

Slurm¶

Hold and requeue the running job so the scheduler can re-place it instead of leaving it pinned to the scattered nodes. requeuehold returns the job to pending in held state (priority zero); requeue returns it to pending to run when resources fit.³
```
scontrol requeuehold "$JOBID"      # back to pending, held (priority 0)
```
Apply the topology constraint to the pending job. --switches=count[@max-time] caps the leaf switches the allocation may span; the tree plugin then best-fits onto the fewest leaves and the job stays pending until that fits or the optional max-time elapses.¹² On an already-pending job, set it via scontrol update; Features, NumNodes, and similar are modifiable, hardware config is not.³
```
# cap at one leaf switch, wait up to 30 min for that optimal placement:
scontrol update JobId="$JOBID" Switches=1@30:00
scontrol release "$JOBID"
```
Prefer to fix it at submission so the next run starts correct; the constraint belongs in the batch header (Slurm topology placement):
```
#SBATCH --partition=blackwell
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH --switches=1@00:30:00        # at most 1 leaf switch; wait <=30m for it
```
If the job is too large for one leaf, constrain to the fewest leaves under one spine rather than letting it scatter. Raise the count to the minimum that fits (e.g. two leaves of a 16-leaf-per-spine fabric) so the all-reduce still avoids cross-spine hops where possible (Slurm topology placement):
```
scontrol update JobId="$JOBID" Switches=2@30:00
```
On a block-topology fabric the analogous lever is topology/block with BlockSizes in topology.conf; jobs pack into the smallest block that fits.²

Re-launch and confirm gang placement. Once it dispatches, re-derive MASTER_ADDR and start the ranks; the canonical multi-node torchrun launch and gres.conf GPU binding live on Slurm. The block below is a minimal reminder, not the full reference.

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
srun --gpu-bind=closest torchrun --nnodes="$SLURM_NNODES" --nproc_per_node=8 \
  --rdzv_backend=c10d --rdzv_endpoint="$MASTER_ADDR:23456" train.py --fsdp --bf16

Kubernetes¶

For tightly-coupled training on Kubernetes, co-locate the pods in one topology domain with Kueue Topology Aware Scheduling. A cluster-scoped Topology object models the fabric as ordered levels keyed on node labels (coarse to fine), and a PodSet annotation requests a domain.⁸⁹ The scheduler deep-dive is on Kubernetes; this is only the placement lever.

Cordon/evict is implicit: delete and resubmit the workload with the annotation (a running pod's topology request is fixed at admission). First confirm the Topology exists and names the levels of your fabric:⁹

apiVersion: kueue.x-k8s.io/v1beta2
kind: Topology
metadata:
  name: gpu-fabric
spec:
  levels:
  - nodeLabel: "cloud.provider.com/topology-block"
  - nodeLabel: "cloud.provider.com/topology-rack"
  - nodeLabel: "kubernetes.io/hostname"

Annotate the PodSet to require same-domain placement. kueue.x-k8s.io/podset-required-topology forces all pods onto nodes within one domain at the named level; kueue.x-k8s.io/podset-preferred-topology makes it a preference with fallback to the next-broader level if it does not fit.⁸ For a job that must stay rail-local, use required at the rack (leaf) level:
```
spec:
  template:
    metadata:
      annotations:
        kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"
```
Requires the TopologyAwareScheduling feature gate (beta, enabled by default since Kueue v0.14).⁸ Re-apply the manifest to admit the job under the constraint.

Verification¶

Do not call it fixed on placement alone; require a measured recovery. The proof is the same busbw and step-time baseline the job regressed against.

Ranks are leaf-local. Re-read the allocation and confirm the nodes now sit on the minimum leaf switches (Slurm), or the pods all carry the same rack/domain label (Kubernetes):

scontrol -d show job "$JOBID"
scontrol show topology node=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
# K8s: kubectl get pods -l job-name=<job> -o wide   # all on one rack's nodes

nccl-tests busbw recovers across the job's nodes. Build and run all_reduce_perf from NVIDIA/nccl-tests over the re-placed nodes and read busbw, not algbw. Bus bandwidth applies a per-collective correction (AllReduce: 2*(n-1)/n) to the algorithm bandwidth "to reflect the speed of the inter-GPU communication", so it can be "compare[d] with the hardware peak bandwidth, independently of the number of ranks".⁷ A rail-local job should approach the inter-node fabric figure; a scattered one sits well below it.
```
# multi-node (nccl-tests built with MPI=1), one rank per GPU:
mpirun -np $((NODES*8)) -N 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
```
Confirm NCCL_DEBUG=INFO shows the IB/RoCE GDR transport (NET/IB/.../GDRDMA) on the inter-node hops, not a NET/Socket fallback.⁵⁶
Step time / tokens-per-sec-per-GPU returns to baseline on the re-launched job, with the comms phase no longer dominant in the profile (MFU regression, observability). Record the before/after busbw and step time and the single variable changed (the --switches count or the topology annotation) so the win is auditable (SRE and MLOps practices).

Rollback¶

This runbook adds a placement constraint; the rollback is to relax or remove it, not to leave the job wedged pending.

The constraint cannot be satisfied (job stuck pending). A too-tight --switches count on a fragmented partition can leave the job pending past its max-time with no allocation. Relax to more leaves, or drop the constraint to let it run scattered while you defragment the partition. A slow job beats a job that never starts (Slurm topology placement):
```
scontrol update JobId="$JOBID" Switches=4@15:00     # widen the cap
# or remove the cap entirely (revert to unconstrained placement):
scontrol update JobId="$JOBID" Switches=0
scontrol release "$JOBID"
```
On Kubernetes, downgrade podset-required-topology to podset-preferred-topology (or a broader level) so the workload admits with best-effort co-location instead of blocking.⁸
The regression was not placement after all. If busbw stays low with ranks confirmed leaf-local and the GDR transport up, the cause is not topology. Revert the constraint and divert: a degraded link or SM goes to fabric bring-up / benchmarking; a transport fallback or partial stall goes to the NCCL-hang runbook; a config/kernel regression goes to the MFU regression runbook.
Bake the fix in. Once a constraint demonstrably recovers busbw, move it from the ad-hoc scontrol update into the submission template / job manifest in git, so the next run starts topology-aware and never regresses (SRE and MLOps practices).

Slurm topology placement: the reference for topology.conf, the tree/block plugins, and --switches (the model this runbook acts on).
fabric bring-up / benchmarking: proving links, transport, and busbw (the verification used here, and the fault-vs-placement discriminator).
NCCL-hang runbook: full collective stall (this runbook is for slow placement, not a wedge).
MFU regression runbook: when comms-bound slowness is config/kernel, not placement.
Fabric Manager failure runbook: NVLink domain down on NVSwitch nodes (rule out before blaming placement).
operational runbooks: runbook index.

References¶

Slurm sbatch — --switches=<count>[@max-time]: "the maximum count of leaf switches desired for the job allocation and optionally the maximum time to wait for that number of switches": https://slurm.schedmd.com/sbatch.html
Slurm topology guide — TopologyPlugin (topology/tree, topology/block), best-fit onto lowest-level / leaf switches, topology.conf SwitchName/Nodes/Switches, block BlockSizes: https://slurm.schedmd.com/topology.html
Slurm topology.conf reference: https://slurm.schedmd.com/topology.conf.html
Slurm scontrol — show topology, -d/--details job detail, requeue / requeuehold, hold / release, update JobId=, show hostnames: https://slurm.schedmd.com/scontrol.html
NCCL environment variables — NCCL_TOPO_DUMP_FILE ("Path to a file to dump the XML topology to after detection"), NCCL_DEBUG, NCCL_DEBUG_SUBSYS (INIT/GRAPH/NET): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
NCCL networking troubleshooting (transport selection, GDR, fallback): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html
NVIDIA/nccl-tests — all_reduce_perf, -b/-e/-f/-g, mpirun -np <ranks> -N <gpus_per_node> ... -g 1, busbw vs algbw and the AllReduce 2*(n-1)/n correction: https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md
Kueue Topology Aware Scheduling — kueue.x-k8s.io/podset-required-topology / podset-preferred-topology, TopologyAwareScheduling feature gate (beta, default since v0.14): https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/
Kueue Topology object (apiVersion: kueue.x-k8s.io/v1beta2) — spec.levels[].nodeLabel: https://kueue.sigs.k8s.io/docs/concepts/topology/

Slurm sbatch — --switches=<count>[@max-time]: "When a tree topology is used, this defines the maximum count of leaf switches desired for the job allocation and optionally the maximum time to wait for that number of switches. If Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired switch count or the time limit expires." https://slurm.schedmd.com/sbatch.html ↩
Slurm topology guide — TopologyPlugin options including topology/tree and topology/block; tree plugin "identify the lowest level switch in the hierarchy that can satisfy a job's request and then allocate resources on its underlying leaf switches using a best-fit algorithm"; --switches=count[@time] user constraint; topology.conf leaf switches use SwitchName + Nodes, aggregation switches use SwitchName + Switches; block topology uses BlockSizes. https://slurm.schedmd.com/topology.html ↩↩↩
Slurm scontrol — show job displays NodeList and BatchHost; -d/--details adds per-node CPU/NUMA allocation; show hostnames expands a hostlist (defaulting to SLURM_JOB_NODELIST); show topology [unit=NAME] [node=NAME] displays the topology layout and the units/parent switches connected to a node; requeue returns a running/suspended/finished batch job to pending; requeuehold does the same and holds it at priority zero; hold sets priority 0 on a pending job, release clears it; update JobId=<id> modifies attributes such as Partition, Features, NumNodes, and Switches=<count>[@<max-time-to-wait>] ("the maximum count of switches desired for the job allocation ... the job remain pending until it either finds an allocation with desired switch count or the time limit expires"), but not hardware config. https://slurm.schedmd.com/scontrol.html ↩↩↩
NCCL environment variables — NCCL_TOPO_DUMP_FILE: "Path to a file to dump the XML topology to after detection." NCCL_TOPO_FILE loads an XML topology before detection. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩
NCCL environment variables — NCCL_DEBUG=INFO "Prints debug information"; NCCL_DEBUG_SUBSYS filters that output by subsystem, supported subsystems include INIT, GRAPH, and NET; prefix with ^ to disable a subsystem. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩
NCCL networking troubleshooting — reading the chosen transport and confirming GPUDirect RDMA vs a sockets fallback. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html ↩
NVIDIA/nccl-tests PERFORMANCE.md — algorithm bandwidth is size (S) / time (t); bus bandwidth is "obtained applying a formula to the algorithm bandwidth to reflect the speed of the inter-GPU communication" so that "we can compare it with the hardware peak bandwidth, independently of the number of ranks used", via a per-collective correction factor (AllReduce: 2*(n-1)/n). https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md ↩
Kueue Topology Aware Scheduling — kueue.x-k8s.io/podset-required-topology "requires scheduling all pods on nodes within the same topology domain corresponding to the topology level indicated by the annotation value"; kueue.x-k8s.io/podset-preferred-topology makes same-domain placement "a preference rather than requirement"; requires the TopologyAwareScheduling feature gate, a beta feature enabled by default since Kueue v0.14. https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/ ↩↩↩↩
Kueue Topology — a cluster-scoped Topology object (served under apiVersion: kueue.x-k8s.io/v1beta2) defines the hierarchy of nodes via spec.levels[].nodeLabel (coarse to fine, e.g. cloud.provider.com/topology-block, cloud.provider.com/topology-rack, kubernetes.io/hostname). https://kueue.sigs.k8s.io/docs/concepts/topology/ ↩↩