Runbook: scheduler: GPU job stuck pending¶
Scope: diagnose a Kubernetes (Pending) or Slurm (PD) GPU job that never starts (insufficient allocatable GPUs, taints/affinity, MIG/profile mismatch, quota, or gang-scheduling deadlock) and get it scheduled.
Run this when a GPU workload sits in
Pending(k8s) orPD(Slurm) and never transitions to running: the pod/job is admitted but no node satisfies its request, GPUs look idle but the scheduler refuses to place the work. Severity: workload-blocked, not crashed. There is no container/process to read, only a scheduler verdict to decode.Reference templates on real APIs; pin versions and validate before production use. Not hardware-tested.
A pending job is a placement failure, not a runtime fault. The scheduler has a reason; read it first, do not guess. Common roots: the allocatable GPU pool is exhausted, the request names a resource the node does not advertise (MIG profile mismatch; see MIG partitioning), a taint/affinity/quota gate blocks the only eligible nodes, or a gang/co-scheduling constraint cannot assemble the full set at once. Node-level GPU readiness (driver, Fabric Manager, health gating) is upstream: if nodes are NotReady or cordoned by health gating, fix that first via GPU health gating, the kernel/GPU-missing runbook, and the Fabric Manager runbook.
Trigger¶
- k8s:
kubectl get podshowsPending;kubectl describe podEvents showFailedSchedulingwith0/N nodes are available. - Slurm:
squeueshows statePDwith a reason in parentheses (Resources,Priority,ReqNodeNotAvail,AssocGrp*Limit,QOSMax*). - GPUs appear idle (
nvidia-smion nodes shows free memory) yet the scheduler will not place the job.
Pre-checks¶
- Confirm nodes are actually schedulable and GPU-ready. A cordoned,
NotReady, or health-gated node will not accept work even with free GPUs. If allocatable GPUs are zero cluster-wide, this is a node/device-plugin problem, not a scheduling one (GPU health gating, the kernel/GPU-missing runbook):Slurm equivalent, confirm nodes are notkubectl get nodes -o wide # Ready / SchedulingDisabled? kubectl describe node <gpu-node> | grep -A3 "Allocatable" # nvidia.com/gpu countdrain/down: - Confirm the GPU device plugin / GPU Operator is healthy. If it is crashlooping, nodes stop advertising
nvidia.com/gpuand every GPU pod pends (the GPU software stack, the Fabric Manager runbook): - Read the request the job actually made. Quantity, resource name (plain GPU vs a MIG profile), nodeSelector/affinity, tolerations. A request for
nvidia.com/mig-3g.40gbwill never match a node advertising onlynvidia.com/gpu(MIG partitioning). - Note whether the job needs a gang (all-or-nothing N pods, e.g. a multi-node training launch). Partial placement that never completes is a deadlock, not slow scheduling.
Flow¶
flowchart TB
A["Job stuck Pending / PD"] --> B["Read scheduler reason"]
B -->|"k8s: kubectl describe pod"| C["Parse FailedScheduling event"]
B -->|"Slurm: squeue %r, scontrol show job"| D["Parse PD reason"]
C --> E{"Reason class?"}
D --> E
E -->|"Insufficient nvidia.com/gpu / Resources"| F["Pool exhausted or wrong resource name"]
E -->|"untolerated taint / node affinity"| G["Taint/affinity/nodeSelector gate"]
E -->|"MIG profile not advertised"| H["MIG/profile mismatch"]
E -->|"QOS/AssocGrp/quota limit"| I["Quota or ResourceQuota"]
E -->|"gang/co-scheduling incomplete"| J["Gang deadlock"]
F --> K["Free capacity or fix resource name"]
G --> L["Add toleration / fix selector"]
H --> M["Match profile to node MIG geometry"]
I --> N["Raise quota or reduce request"]
J --> O["Cordon/drain to free a clean gang set"]
K --> P["Verify: job transitions Running"]
L --> P
M --> P
N --> P
O --> P
Procedure¶
Cordon/drain before mutating node state. Never reconfigure MIG geometry or evict the device plugin on a node that is still running other jobs without draining it first; partial MIG reconfiguration corrupts the allocatable pool.
Kubernetes¶
- Read the scheduler's exact reason. This is the single most important step; the Events block names every gate the job failed:
Map the message:
Insufficient nvidia.com/gpu(pool exhausted or wrong resource name) → step 2;untolerated taint/node(s) didn't match Pod's node affinity/selector→ step 3; a MIG resource name the node never advertises → step 4;exceeded quota→ step 5. - Check the allocatable vs. used GPU pool. Confirm whether GPUs are genuinely free or already fully claimed by other pods: If allocatable is non-zero but allocated equals allocatable, the pool is full, so free capacity (let jobs drain, or scale the node pool). If the node advertises a different resource name than the request, fix the request (step 4).
- Inspect taints, affinity, and tolerations. GPU node pools are routinely tainted (
nvidia.com/gpu=present:NoSchedule); the pod must tolerate it and itsnodeSelector/affinity must match real node labels:Add the missing toleration or correct the selector to a label that exists. GPUs are requested inkubectl get node <gpu-node> -o jsonpath='{.spec.taints}{"\n"}{.metadata.labels}{"\n"}' kubectl get pod <pod> -o jsonpath='{.spec.tolerations}{"\n"}{.spec.nodeSelector}{"\n"}'resources.limitsonly (limit is used as the request; if both are set they must be equal). - Reconcile the MIG profile. With the device plugin in
single/mixedMIG strategy, nodes advertise profile-named resources (nvidia.com/mig-1g.5gb,nvidia.com/mig-3g.20gb, etc.). A request must name a profile the node's MIG geometry actually exposes (MIG partitioning). Confirm what the node offers, then either fix the request or, after cordon + drain, reconfigure MIG geometry to match:Stale or half-applied MIG state is its own failure mode → the MIG-state-stale runbook.kubectl describe node <gpu-node> | grep "nvidia.com/mig-" # advertised profiles # to change geometry (drain the node first; see runbook-mig-state-stale): kubectl cordon <gpu-node> kubectl drain <gpu-node> --ignore-daemonsets --delete-emptydir-data # ... apply MIG config / GPU Operator mig.config label ... kubectl uncordon <gpu-node> - Check ResourceQuota / namespace limits. A namespace quota on
requests.nvidia.com/gpusilently blocks admission withexceeded quota: Raise the quota or reduce the request. For an all-or-nothing multi-pod job (gang), if a co-scheduler (e.g. Kueue/Volcano-style queueing) holds the job because the full set cannot be placed at once, free a contiguous block: cordon nodes running lower-priority work, drain them, and let the gang admit as a unit.
Slurm¶
- Read the pending reason. The reason in parentheses is the verdict; decode it, do not resubmit blindly:
squeue -j <jobid> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R" scontrol show job <jobid> # full request: TRES, GRES, Partition, NodeList, ReasonResources= waiting for capacity to free;Priority= higher-priority jobs ahead;ReqNodeNotAvail= a specifically required node is down/drained;AssocGrp*Limit/QOSMax*= a quota cap on GPUs. - Confirm GPU GRES availability on the target partition. The job may request a
gres/gpucount or a type (gpu:a100:8) no idle node can satisfy:If nodes aresinfo -p <partition> -o "%N %t %C %G" # node states + GRES scontrol show node <node> | grep -E "Gres|State|CfgTRES|AllocTRES"drain/down, that is the block; return them via the node-fault path (the GPU-fault/RMA runbook) rather than tuning the job. - Reconcile the GRES request. A wrong GPU type, count, or
--gpus-per-nodethat no node provides keeps the job inResources/ReqNodeNotAvailforever. Match the request to advertisedGresfrom step 2, or target a partition that has the type. - Check association/QOS GPU limits when the reason is
AssocGrp*LimitorQOSMax*: Either wait for the user's running GPUs to free below the cap, or have an admin raise the limit. Do not raise limits to mask a runaway submitter.
Verification¶
- k8s: the pod transitions out of
Pendingand binds to a node; the priorFailedSchedulingreason no longer appears: - Slurm: the job leaves
PDforRand is allocated the requested GPUs: - The job runs real work:
nvidia-smion the bound node shows the job's process holding the GPU(s), not idle.
Rollback¶
A scheduling fix is mostly request/quota changes, not a node mutation, so revert by undoing only what you changed:
- Restore node schedulability if you cordoned/drained:
kubectl uncordon <node>(k8s) orscontrol update nodename=<node> state=resume(Slurm). Never leave a node cordoned after the gang admits. - Revert quota or QOS changes once the immediate job is placed; transient over-grants become permanent capacity leaks.
- Revert MIG geometry changes only via the MIG path with a drain (see the MIG-state-stale runbook); do not flip MIG mode on a node holding live jobs.
- If the root cause was a node that was down/health-gated (not a request bug), the real fix is on the node. Divert to the kernel/GPU-missing runbook, the Fabric Manager runbook, or the GPU-fault/RMA runbook, and let the scheduler place the job once capacity returns.
Related runbooks¶
- the MIG-state-stale runbook: stale/half-applied MIG geometry (the most common MIG profile-mismatch root cause).
- the kernel/GPU-missing runbook: node advertises zero allocatable GPUs (device plugin / driver).
- the Fabric Manager runbook: FM down keeps NVSwitch nodes out of the schedulable pool.
- the GPU-fault/RMA runbook: a drained/down node is a hardware fault, not a scheduling bug.
- the inference-SLO-breach runbook: when pending capacity pressure shows up as serving latency.
- operational runbooks: operational runbooks index.
- troubleshooting runbook: general triage index.
References¶
- Kubernetes — Schedule GPUs (
nvidia.com/gpu,resources.limits): https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ - Kubernetes — Safely Drain a Node (cordon/drain/uncordon, PodDisruptionBudget): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
- Kubernetes —
kubectl drainreference (--ignore-daemonsets): https://kubernetes.io/docs/reference/kubectl/generated/kubectl_drain/ - Kubernetes — Taints and Tolerations: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
- Kubernetes — Resource Quotas: https://kubernetes.io/docs/concepts/policy/resource-quotas/
- NVIDIA k8s-device-plugin — MIG strategies (
MIG_STRATEGY,nvidia.com/mig-*resource names): https://github.com/NVIDIA/k8s-device-plugin - NVIDIA GPU Operator — MIG configuration: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html
- Slurm —
squeue(job state reason codes,%r/%Rformat): https://slurm.schedmd.com/squeue.html - Slurm — Generic Resource (GRES) scheduling (
gres/gpu,--gpus): https://slurm.schedmd.com/gres.html - Slurm —
scontrol(show job/node, resume state): https://slurm.schedmd.com/scontrol.html - Slurm —
sacctmgr(QOS / association GRES limits): https://slurm.schedmd.com/sacctmgr.html
Related: MIG Partitioning · GPU Health Gating · GPU Software Stack · MIG State Stale · Kernel/GPU Missing · Fabric Manager Failure · Operational Runbooks · Glossary