Runbook: topology-unaware scheduling starvation¶
Scope: a tightly-coupled training job runs but crawls because its ranks landed scattered across the spine instead of rail-local on the fewest leaf switches, so every collective pays extra hops. Re-place the job under topology constraints and prove the bus bandwidth recovers.
Reference templates on real APIs. Nothing here was executed on hardware; pin versions to your Slurm / Kueue / NCCL release, substitute real node and switch names, and validate on one job before fleet use.
This is the placement-quality counterpart to the Slurm topology placement reference (how topology.conf, --switches, and the tree/block plugins model the fabric) and the fabric bring-up / benchmarking procedure (how to read busbw and confirm the transport). It is distinct from a fabric fault: here the fabric is healthy and the job is slow purely because the scheduler spread the ranks. If a collective fully stalls (step time goes to infinity, no progress) that is the NCCL-hang runbook, not this. The scheduler internals themselves live on Slurm, Kubernetes, and k3s; link them, do not re-derive them here.
The failure mode: synchronous data/model-parallel training is bounded by its slowest link. The all-reduce in networking fabric is rail-sensitive. Ranks on the same leaf switch exchange gradients in one hop; ranks split across leaves traverse leaf -> spine -> leaf for every step. A topology-unaware allocation does not fail, it just taxes every iteration, so it surfaces as an MFU regression with comms as the dominant phase, not as an outage.
Trigger¶
- Step time / tokens-per-sec-per-GPU below baseline on the same model + parallelism + node count, with the NCCL/comms phase dominant in the profile (MFU regression, observability, SLO/SLI catalog). Compute is fine; the all-reduce is slow.
nccl-testsbusbw across the job's actual nodes is well below the expected inter-node figure for the fabric, even though point-to-point links are healthy (fabric bring-up / benchmarking).- The job spans more leaf switches than it needs. An 8-node job that could sit on one or two leaves is instead smeared across several leaves and the spine, visible in
scontrol show topology/scontrol -d show job(Slurm) or in the nodes' topology labels (Kubernetes). - Regression appears after a scheduling change: a new partition, a removed or stale
topology.conf, a job submitted without--switches, or a K8s workload submitted without topology-aware scheduling (Slurm topology placement).
Pre-checks¶
Establish that this is a placement problem and not a fault, before changing any constraint.
- A baseline must exist. Without a recorded step-time / busbw baseline for this exact model + parallelism + node count, there is nothing to regress against, so establish one first (MFU regression, SLO/SLI catalog).
- Rule out a fabric fault first. A degraded link or a down subnet manager mimics topology starvation. Confirm links are up and the SM is converged before blaming placement (fabric bring-up / benchmarking); on NVSwitch nodes confirm Fabric Manager is active (Fabric Manager failure runbook). If a collective is fully wedged, divert to the NCCL-hang runbook.
- Read the actual placement (Slurm).
scontrol show topologydisplays the configured switch/block layout;node=<name>reports the units a node connects to.-dadds per-node allocation detail to the job.32Map each allocated node to its leaf switch fromJOBID=123456 scontrol -d show job "$JOBID" # NodeList, BatchHost, per-node alloc scontrol show hostnames "$(scontrol show job "$JOBID" -o | grep -oP 'NodeList=\K\S+')" scontrol show topology node=gpu001 # switches/blocks this node hangs offtopology.conf; if the job's nodes fan out across many leaves and a spine, that is the starvation signature (Slurm topology placement). - Read the path NCCL actually built. Dump the detected topology graph and confirm the inter-node transport is GDR-capable IB/RoCE, not a TCP fallback.
NCCL_TOPO_DUMP_FILEwrites the detected XML topology after detection, andNCCL_DEBUG_SUBSYSfiltersNCCL_DEBUG=INFOto theGRAPH/INIT/NETsubsystems.45NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,NET \ NCCL_TOPO_DUMP_FILE=/tmp/nccl-topo.xml <launch cmd> 2>&1 | grep -E "NET/IB|GDRDMA|NET/Socket"NET/IB/.../GDRDMAon the inter-node hops is the intended path; aNET/Socketline where IB was expected is a transport misconfiguration to fix first (NCCL-hang runbook, performance tuning). That is a different bug from topology spread, and masks it. - Confirm the constraint surface exists. Slurm:
topology/tree(ortopology/block) is set inslurm.confandtopology.confis present and current. Kubernetes: aTopologyobject and theTopologyAwareSchedulingfeature gate are in place (Slurm topology placement). If the fabric model is missing, no constraint can help until it is declared.
Procedure¶
Re-placement means evicting and re-launching the job under topology constraints. Cordon/requeue before mutating; never edit a running allocation in place. Constrain on the smallest domain that still fits the job (one leaf if it fits, else the fewest leaves under one spine).
Slurm¶
-
Hold and requeue the running job so the scheduler can re-place it instead of leaving it pinned to the scattered nodes.
requeueholdreturns the job to pending in held state (priority zero);requeuereturns it to pending to run when resources fit.3 -
Apply the topology constraint to the pending job.
--switches=count[@max-time]caps the leaf switches the allocation may span; the tree plugin then best-fits onto the fewest leaves and the job stays pending until that fits or the optional max-time elapses.12 On an already-pending job, set it viascontrol update;Features,NumNodes, and similar are modifiable, hardware config is not.3Prefer to fix it at submission so the next run starts correct; the constraint belongs in the batch header (Slurm topology placement):# cap at one leaf switch, wait up to 30 min for that optimal placement: scontrol update JobId="$JOBID" Switches=1@30:00 scontrol release "$JOBID" -
If the job is too large for one leaf, constrain to the fewest leaves under one spine rather than letting it scatter. Raise the count to the minimum that fits (e.g. two leaves of a 16-leaf-per-spine fabric) so the all-reduce still avoids cross-spine hops where possible (Slurm topology placement):
On a block-topology fabric the analogous lever istopology/blockwithBlockSizesintopology.conf; jobs pack into the smallest block that fits.2 -
Re-launch and confirm gang placement. Once it dispatches, re-derive
MASTER_ADDRand start the ranks; the canonical multi-nodetorchrunlaunch andgres.confGPU binding live on Slurm. The block below is a minimal reminder, not the full reference.
Kubernetes¶
For tightly-coupled training on Kubernetes, co-locate the pods in one topology domain with Kueue Topology Aware Scheduling. A cluster-scoped Topology object models the fabric as ordered levels keyed on node labels (coarse to fine), and a PodSet annotation requests a domain.89 The scheduler deep-dive is on Kubernetes; this is only the placement lever.
-
Cordon/evict is implicit: delete and resubmit the workload with the annotation (a running pod's topology request is fixed at admission). First confirm the
Topologyexists and names the levels of your fabric:9 -
Annotate the PodSet to require same-domain placement.
kueue.x-k8s.io/podset-required-topologyforces all pods onto nodes within one domain at the named level;kueue.x-k8s.io/podset-preferred-topologymakes it a preference with fallback to the next-broader level if it does not fit.8 For a job that must stay rail-local, use required at the rack (leaf) level:Requires thespec: template: metadata: annotations: kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"TopologyAwareSchedulingfeature gate (beta, enabled by default since Kueue v0.14).8 Re-apply the manifest to admit the job under the constraint.
Verification¶
Do not call it fixed on placement alone; require a measured recovery. The proof is the same busbw and step-time baseline the job regressed against.
-
Ranks are leaf-local. Re-read the allocation and confirm the nodes now sit on the minimum leaf switches (Slurm), or the pods all carry the same rack/domain label (Kubernetes):
-
nccl-testsbusbw recovers across the job's nodes. Build and runall_reduce_perffromNVIDIA/nccl-testsover the re-placed nodes and read busbw, not algbw. Bus bandwidth applies a per-collective correction (AllReduce:2*(n-1)/n) to the algorithm bandwidth "to reflect the speed of the inter-GPU communication", so it can be "compare[d] with the hardware peak bandwidth, independently of the number of ranks".7 A rail-local job should approach the inter-node fabric figure; a scattered one sits well below it.Confirm# multi-node (nccl-tests built with MPI=1), one rank per GPU: mpirun -np $((NODES*8)) -N 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1NCCL_DEBUG=INFOshows the IB/RoCE GDR transport (NET/IB/.../GDRDMA) on the inter-node hops, not aNET/Socketfallback.56 -
Step time / tokens-per-sec-per-GPU returns to baseline on the re-launched job, with the comms phase no longer dominant in the profile (MFU regression, observability). Record the before/after busbw and step time and the single variable changed (the
--switchescount or the topology annotation) so the win is auditable (SRE and MLOps practices).
Rollback¶
This runbook adds a placement constraint; the rollback is to relax or remove it, not to leave the job wedged pending.
-
The constraint cannot be satisfied (job stuck pending). A too-tight
--switchescount on a fragmented partition can leave the job pending past its max-time with no allocation. Relax to more leaves, or drop the constraint to let it run scattered while you defragment the partition. A slow job beats a job that never starts (Slurm topology placement):On Kubernetes, downgradescontrol update JobId="$JOBID" Switches=4@15:00 # widen the cap # or remove the cap entirely (revert to unconstrained placement): scontrol update JobId="$JOBID" Switches=0 scontrol release "$JOBID"podset-required-topologytopodset-preferred-topology(or a broader level) so the workload admits with best-effort co-location instead of blocking.8 -
The regression was not placement after all. If busbw stays low with ranks confirmed leaf-local and the GDR transport up, the cause is not topology. Revert the constraint and divert: a degraded link or SM goes to fabric bring-up / benchmarking; a transport fallback or partial stall goes to the NCCL-hang runbook; a config/kernel regression goes to the MFU regression runbook.
-
Bake the fix in. Once a constraint demonstrably recovers busbw, move it from the ad-hoc
scontrol updateinto the submission template / job manifest in git, so the next run starts topology-aware and never regresses (SRE and MLOps practices).
Related runbooks¶
- Slurm topology placement: the reference for
topology.conf, the tree/block plugins, and--switches(the model this runbook acts on). - fabric bring-up / benchmarking: proving links, transport, and busbw (the verification used here, and the fault-vs-placement discriminator).
- NCCL-hang runbook: full collective stall (this runbook is for slow placement, not a wedge).
- MFU regression runbook: when comms-bound slowness is config/kernel, not placement.
- Fabric Manager failure runbook: NVLink domain down on NVSwitch nodes (rule out before blaming placement).
- operational runbooks: runbook index.
References¶
- Slurm
sbatch—--switches=<count>[@max-time]: "the maximum count of leaf switches desired for the job allocation and optionally the maximum time to wait for that number of switches": https://slurm.schedmd.com/sbatch.html - Slurm topology guide —
TopologyPlugin(topology/tree,topology/block), best-fit onto lowest-level / leaf switches,topology.confSwitchName/Nodes/Switches, blockBlockSizes: https://slurm.schedmd.com/topology.html - Slurm
topology.confreference: https://slurm.schedmd.com/topology.conf.html - Slurm
scontrol—show topology,-d/--detailsjob detail,requeue/requeuehold,hold/release,update JobId=,show hostnames: https://slurm.schedmd.com/scontrol.html - NCCL environment variables —
NCCL_TOPO_DUMP_FILE("Path to a file to dump the XML topology to after detection"),NCCL_DEBUG,NCCL_DEBUG_SUBSYS(INIT/GRAPH/NET): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html - NCCL networking troubleshooting (transport selection, GDR, fallback): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html
- NVIDIA/nccl-tests —
all_reduce_perf,-b/-e/-f/-g,mpirun -np <ranks> -N <gpus_per_node> ... -g 1, busbw vs algbw and the AllReduce2*(n-1)/ncorrection: https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md - Kueue Topology Aware Scheduling —
kueue.x-k8s.io/podset-required-topology/podset-preferred-topology,TopologyAwareSchedulingfeature gate (beta, default since v0.14): https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/ - Kueue
Topologyobject (apiVersion: kueue.x-k8s.io/v1beta2) —spec.levels[].nodeLabel: https://kueue.sigs.k8s.io/docs/concepts/topology/
Related: Slurm topology placement · Fabric Bring-Up · NCCL Hang · MFU Regression · Operational Runbooks · Glossary
-
Slurm
sbatch—--switches=<count>[@max-time]: "When a tree topology is used, this defines the maximum count of leaf switches desired for the job allocation and optionally the maximum time to wait for that number of switches. If Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired switch count or the time limit expires." https://slurm.schedmd.com/sbatch.html ↩ -
Slurm topology guide —
TopologyPluginoptions includingtopology/treeandtopology/block; tree plugin "identify the lowest level switch in the hierarchy that can satisfy a job's request and then allocate resources on its underlying leaf switches using a best-fit algorithm";--switches=count[@time]user constraint;topology.confleaf switches useSwitchName+Nodes, aggregation switches useSwitchName+Switches; block topology usesBlockSizes. https://slurm.schedmd.com/topology.html ↩↩↩ -
Slurm
scontrol—show jobdisplaysNodeListandBatchHost;-d/--detailsadds per-node CPU/NUMA allocation;show hostnamesexpands a hostlist (defaulting toSLURM_JOB_NODELIST);show topology [unit=NAME] [node=NAME]displays the topology layout and the units/parent switches connected to a node;requeuereturns a running/suspended/finished batch job to pending;requeueholddoes the same and holds it at priority zero;holdsets priority 0 on a pending job,releaseclears it;update JobId=<id>modifies attributes such asPartition,Features,NumNodes, andSwitches=<count>[@<max-time-to-wait>]("the maximum count of switches desired for the job allocation ... the job remain pending until it either finds an allocation with desired switch count or the time limit expires"), but not hardware config. https://slurm.schedmd.com/scontrol.html ↩↩↩ -
NCCL environment variables —
NCCL_TOPO_DUMP_FILE: "Path to a file to dump the XML topology to after detection."NCCL_TOPO_FILEloads an XML topology before detection. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩ -
NCCL environment variables —
NCCL_DEBUG=INFO"Prints debug information";NCCL_DEBUG_SUBSYSfilters that output by subsystem, supported subsystems includeINIT,GRAPH, andNET; prefix with^to disable a subsystem. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩ -
NCCL networking troubleshooting — reading the chosen transport and confirming GPUDirect RDMA vs a sockets fallback. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html ↩
-
NVIDIA/nccl-tests PERFORMANCE.md — algorithm bandwidth is
size (S) / time (t); bus bandwidth is "obtained applying a formula to the algorithm bandwidth to reflect the speed of the inter-GPU communication" so that "we can compare it with the hardware peak bandwidth, independently of the number of ranks used", via a per-collective correction factor (AllReduce:2*(n-1)/n). https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md ↩ -
Kueue Topology Aware Scheduling —
kueue.x-k8s.io/podset-required-topology"requires scheduling all pods on nodes within the same topology domain corresponding to the topology level indicated by the annotation value";kueue.x-k8s.io/podset-preferred-topologymakes same-domain placement "a preference rather than requirement"; requires theTopologyAwareSchedulingfeature gate, a beta feature enabled by default since Kueue v0.14. https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/ ↩↩↩↩ -
Kueue Topology — a cluster-scoped
Topologyobject (served underapiVersion: kueue.x-k8s.io/v1beta2) defines the hierarchy of nodes viaspec.levels[].nodeLabel(coarse to fine, e.g.cloud.provider.com/topology-block,cloud.provider.com/topology-rack,kubernetes.io/hostname). https://kueue.sigs.k8s.io/docs/concepts/topology/ ↩↩