Markdown

Slurm topology-aware placement¶

Scope: making Slurm pack a tightly-coupled job onto the fewest, closest network leaves so its collectives stay rail-local. Covers topology.conf with the topology/tree and topology/block plugins, --switches=<count>@<time>, and why topology-unaware placement starves all-reduce. Slurm basics (partitions, GRES, the srun torchrun launch) live in Slurm for GPU Clusters and are not repeated here.

Reference templates on real APIs; pin versions and validate before production use. Config and commands below are reference templates, not hardware-tested.

What it is¶

Without a topology plugin, "Slurm's native mode of resource selection considers nodes as a one-dimensional array" ¹: it fills the partition by node index and ignores the fabric. Topology-aware placement teaches the scheduler the shape of the network so it can allocate the nodes of a job under as few switches (or within as few blocks) as possible, minimising the cross-spine hops every collective pays for.

Two plugins matter for GPU clusters, selected by TopologyPlugin in slurm.conf:

topology/tree models a hierarchical (fat-tree) network and allocates "resources to jobs on a hierarchical network to minimize network contention" using a best-fit on leaf switches ¹. The fabric is declared in topology.conf as leaf switches (a SwitchName plus its Nodes) and higher-level switches (a SwitchName plus its child Switches) ².
topology/block allocates "resources to jobs within a strictly enforced, hierarchical block structure" and "prioritizes the placement of jobs to minimize fragmentation across the cluster, as opposed to the tree topology, which focuses on fitting jobs on the first available resources" ¹. It starts from contiguous base blocks (bblocks) that aggregate into larger blocks at the sizes named by BlockSizes ². On NVLink-domain hardware a block is typically one rack: CoreWeave maps "each GB200 or GB300 NVL72-based system" to "a Block containing 18 Nodes in the same NVLink domain" ⁴.

The block hierarchy is the natural fit for rail/NVLink-domain clusters because its enforced boundaries match physical scale-up domains; the tree plugin is the long-standing general fat-tree model. Pick one per cluster (or per partition, see below); they are configured with different topology.conf parameters.

Why it's needed (and when)¶

A synchronous data/model-parallel step is a barrier: the all-reduce finishes only when the slowest rank-to-rank path finishes. Scatter the same eight nodes across two spines and every iteration's gradient sync now traverses the spine instead of a single leaf, so the slowest path, and therefore step time, is set by the worst placement, not the best. Topology-unaware placement does exactly this: it packs by node index, so a job that could fit under one leaf gets spread across the fabric whenever those index-adjacent nodes happen to sit on different switches. The collective is then bottlenecked on oversubscribed inter-switch links, GPUs sit busy-but-idle waiting on the barrier, and MFU drops: the fabric, not compute, gates the job (HPC Networking Fabric, Performance Optimization and Tuning).

Use topology-aware placement when:

The job is tightly coupled: synchronous training, large all-reduce/all-gather, anything where one slow link stalls the world-size (Distributed Training Platform).
The fabric is blocking or rail-aligned above the leaf/rack, i.e. inter-switch bandwidth is less than the sum of node bandwidth, so spanning switches actually costs throughput.

It matters less for embarrassingly-parallel work (independent single-node array jobs, batch inference) where ranks do not communicate; there, packing by switch only helps adjacency, not correctness. The scheduler trade-off (fewer switches vs. faster start) is exposed to the user by --switches (below). For the Slurm-vs-Kubernetes placement comparison see Slurm vs Kubernetes for GPUs; the scheduler internals stay in Slurm for GPU Clusters / Kubernetes for GPU Clusters.

How it's set up & managed¶

Tree plugin¶

Set the plugin in slurm.conf, then describe the fabric in topology.conf. Leaf stanzas list Nodes; aggregation stanzas list child Switches ². Slurm's hostlist parser means names need not be consecutive, e.g. Nodes=tux[0-3,12,18-20] and Switches=s[0-2,4-8,12] are valid ².

# slurm.conf
TopologyPlugin=topology/tree

# topology.conf  (topology/tree)
# Three leaf switches, each feeding one rack of GPU nodes, joined by a spine.
SwitchName=leaf0 Nodes=gpu[001-016]
SwitchName=leaf1 Nodes=gpu[017-032]
SwitchName=leaf2 Nodes=gpu[033-048]
SwitchName=spine Switches=leaf[0-2]

With this map, a --nodes=16 job that fits under one leaf is placed there; the best-fit only crosses the spine when a single leaf cannot satisfy the request ¹. The optional LinkSpeed field lets you weight links of differing performance ².

Block plugin¶

The block plugin enforces boundaries instead of merely preferring them. Declare base blocks and the enforceable BlockSizes. Each block must contain at least the planning base-block count of nodes, and "successive BlockSizes must be a power of two larger than the prior values" ².

# slurm.conf
TopologyPlugin=topology/block

# topology.conf  (topology/block)
# 4 base blocks of 32 nodes; planning base block = 32, next enforced level = 128.
BlockName=block1 Nodes=gpu[001-032]
BlockName=block2 Nodes=gpu[033-064]
BlockName=block3 Nodes=gpu[065-096]
BlockName=block4 Nodes=gpu[097-128]
BlockSizes=32,128

Allocation is: find the smallest BlockSizes level that satisfies the request, pick a subset of lower-level blocks inside that aggregating block, then best-fit onto the underlying base blocks ¹. Size the planning base block to the physical scale-up domain (one rack / NVLink domain) so "Nodes from the same job [land] within the same rack whenever possible, which maximizes NVLink fabric performance" ⁴.

Per-partition topology¶

A cluster can run more than one topology: "Each partition can be configured to use a specific topology by specifying the Topology in its partition configuration line" ¹. This lets a rail-aligned training partition use topology/block while a general partition uses topology/tree (verify the partition-line Topology= syntax against slurm.conf for your version).

Applying changes¶

topology.conf is read at controller start and on reconfigure. After editing, reload and confirm Slurm parsed the fabric (this reflects config, not a hardware probe; keep the file in sync with the real cabling):

scontrol reconfigure
scontrol show topology            # tree: switch tree;  block: configured blocks
scontrol show topology leaf0      # inspect one switch/block

Validated usage & tests¶

Request fewer switches and bound the wait. --switches=count[@max-time-to-wait] "defines the maximum count of leaf switches desired for the job allocation"; if Slurm "finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired switch count or the time limit expires" ³. The job does not start sooner on more switches; it waits, then starts anyway when the timer lapses. Accepted time formats include minutes, minutes:seconds, hours:minutes:seconds, days-hours, days-hours:minutes and days-hours:minutes:seconds ³. The cap is administrator-limited via SchedulerParameters max_switch_wait ³¹.

# Confine an 8-node job to a single leaf switch; wait up to 30 minutes for it,
# otherwise start on whatever the scheduler can give.
sbatch --nodes=8 --switches=1@30:00 train.sbatch

Expected: while a one-switch allocation is unavailable, squeue shows the job PENDING; scontrol show job <id> reports a reason consistent with waiting on the optimal switch count. Once a single-leaf set frees up (or 30 minutes elapse) the job starts. Do not assume a specific reason string or timing; read what your version prints.

Confirm where a running job actually landed and that its node set sits under one leaf / in one block:

scontrol show job <id> | grep -E "NodeList|SwitchCount|Switches"
scontrol show hostnames "$SLURM_JOB_NODELIST"      # the allocated nodes
scontrol show topology | grep -A2 leaf0            # cross-check against the map

On the block plugin, request a block-local shape explicitly. --segment=N sets the segment size; CoreWeave's guidance is to "default to --segment=1" so "your job [is] more flexible because it can run on any available Nodes across all Blocks", and --exclusive=topo makes "only the job being submitted [run] in a Block" ⁴ (both are block-plugin features; verify availability for your Slurm version).

# 16-node job, prefer to stay inside one block, exclusive within the block.
srun --nodes=16 --segment=1 --exclusive=topo --gres=gpu:8 ./train_mpi

The application-level proof is a collective benchmark: run nccl-tests all_reduce_perf across the job's nodes and compare bus bandwidth for a single-leaf/single-block placement against a deliberately spread one. The single-domain placement should show higher, more stable busbw; the spread placement should show lower busbw from cross-spine hops. Record your own numbers; do not assume a figure (HPC Networking Fabric, Slurm for GPU Clusters).

Failure modes¶

No topology.conf / wrong plugin: placement is index-based; tightly-coupled jobs scatter across spines and every all-reduce eats extra hops. Symptom: high step time and low MFU with healthy GPUs. Triage in Topology-Unaware Scheduling Starvation.
topology.conf does not match the cabling: the map is config, not a probe; a stale file makes the scheduler "pack" onto switches that are not actually adjacent, so the optimisation is silently wrong. Re-derive the map from the live fabric and scontrol reconfigure (HPC Networking Fabric).
--switches too strict: an unsatisfiable count with a long @time (or none) parks the job PENDING waiting for a one-switch hole that never opens; loosen the count, shorten the wait, or rely on max_switch_wait to cap it ³¹.
Job larger than one block/leaf: on the block plugin a job that exceeds a block is split into equal segments across blocks ("Segments cannot span Blocks"), which can strand capacity and reintroduce cross-block collective cost ⁴; size jobs to the block, or accept the split knowingly.
BlockSizes not power-of-two-stacked / base block mis-sized: the controller rejects or mis-plans the hierarchy; keep successive sizes a power of two larger and the base block equal to the physical scale-up domain ²⁴.

A persistent placement stall or a confirmed topology mis-map escalates to Topology-Unaware Scheduling Starvation. Pair topology gating with health gating so a job is never placed onto a marked-bad node (GPU Health Gating).

References¶

Slurm Topology Guide (plugins, best-fit, --switches, max_switch_wait, per-partition topology): https://slurm.schedmd.com/topology.html
topology.conf (5) — SwitchName/Switches/Nodes/LinkSpeed, BlockName/BlockSizes, hostlist ranges: https://slurm.schedmd.com/topology.conf.html
sbatch (1) — --switches=count[@max-time-to-wait] semantics and time formats: https://slurm.schedmd.com/sbatch.html
slurm.conf (5) — TopologyPlugin, TopologyParam, partition Topology=, SchedulerParameters: https://slurm.schedmd.com/slurm.conf.html
CoreWeave — Topology/Block scheduling on GPU clusters (NVLink-domain blocks, --segment, --exclusive=topo): https://docs.coreweave.com/products/sunk/optimize_workloads/topology-scheduling

Slurm Topology Guide — https://slurm.schedmd.com/topology.html ↩↩↩↩↩↩↩↩
Slurm topology.conf (5) — https://slurm.schedmd.com/topology.conf.html ↩↩↩↩↩↩↩
Slurm sbatch (1) — https://slurm.schedmd.com/sbatch.html ↩↩↩↩
CoreWeave, Topology/Block Scheduling in Slurm — https://docs.coreweave.com/products/sunk/optimize_workloads/topology-scheduling ↩↩↩↩↩