Markdown

Runbook: scheduler: GPU job stuck pending¶

Scope: diagnose a Kubernetes (Pending) or Slurm (PD) GPU job that never starts (insufficient allocatable GPUs, taints/affinity, MIG/profile mismatch, quota, or gang-scheduling deadlock) and get it scheduled.

Run this when a GPU workload sits in Pending (k8s) or PD (Slurm) and never transitions to running: the pod/job is admitted but no node satisfies its request, GPUs look idle but the scheduler refuses to place the work. Severity: workload-blocked, not crashed. There is no container/process to read, only a scheduler verdict to decode.

Reference templates on real APIs; pin versions and validate before production use. Not hardware-tested.

A pending job is a placement failure, not a runtime fault. The scheduler has a reason; read it first, do not guess. Common roots: the allocatable GPU pool is exhausted, the request names a resource the node does not advertise (MIG profile mismatch; see MIG partitioning), a taint/affinity/quota gate blocks the only eligible nodes, or a gang/co-scheduling constraint cannot assemble the full set at once. Node-level GPU readiness (driver, Fabric Manager, health gating) is upstream: if nodes are NotReady or cordoned by health gating, fix that first via GPU health gating, the kernel/GPU-missing runbook, and the Fabric Manager runbook.

Trigger¶

k8s: kubectl get pod shows Pending; kubectl describe pod Events show FailedScheduling with 0/N nodes are available.
Slurm: squeue shows state PD with a reason in parentheses (Resources, Priority, ReqNodeNotAvail, AssocGrp*Limit, QOSMax*).
GPUs appear idle (nvidia-smi on nodes shows free memory) yet the scheduler will not place the job.

Pre-checks¶

Confirm nodes are actually schedulable and GPU-ready. A cordoned, NotReady, or health-gated node will not accept work even with free GPUs. If allocatable GPUs are zero cluster-wide, this is a node/device-plugin problem, not a scheduling one (GPU health gating, the kernel/GPU-missing runbook):
```
kubectl get nodes -o wide                                   # Ready / SchedulingDisabled?
kubectl describe node <gpu-node> | grep -A3 "Allocatable"   # nvidia.com/gpu count
```
Slurm equivalent, confirm nodes are not drain/down:
```
sinfo -N -o "%N %t %G"        # state + GRES (gpu:...) per node
```
Confirm the GPU device plugin / GPU Operator is healthy. If it is crashlooping, nodes stop advertising nvidia.com/gpu and every GPU pod pends (the GPU software stack, the Fabric Manager runbook):
```
kubectl get pods -n gpu-operator                            # plugin / validator / DCGM Ready?
```
Read the request the job actually made. Quantity, resource name (plain GPU vs a MIG profile), nodeSelector/affinity, tolerations. A request for nvidia.com/mig-3g.40gb will never match a node advertising only nvidia.com/gpu (MIG partitioning).
Note whether the job needs a gang (all-or-nothing N pods, e.g. a multi-node training launch). Partial placement that never completes is a deadlock, not slow scheduling.

Flow¶

flowchart TB
    A["Job stuck Pending / PD"] --> B["Read scheduler reason"]
    B -->|"k8s: kubectl describe pod"| C["Parse FailedScheduling event"]
    B -->|"Slurm: squeue %r, scontrol show job"| D["Parse PD reason"]
    C --> E{"Reason class?"}
    D --> E
    E -->|"Insufficient nvidia.com/gpu / Resources"| F["Pool exhausted or wrong resource name"]
    E -->|"untolerated taint / node affinity"| G["Taint/affinity/nodeSelector gate"]
    E -->|"MIG profile not advertised"| H["MIG/profile mismatch"]
    E -->|"QOS/AssocGrp/quota limit"| I["Quota or ResourceQuota"]
    E -->|"gang/co-scheduling incomplete"| J["Gang deadlock"]
    F --> K["Free capacity or fix resource name"]
    G --> L["Add toleration / fix selector"]
    H --> M["Match profile to node MIG geometry"]
    I --> N["Raise quota or reduce request"]
    J --> O["Cordon/drain to free a clean gang set"]
    K --> P["Verify: job transitions Running"]
    L --> P
    M --> P
    N --> P
    O --> P

Procedure¶

Cordon/drain before mutating node state. Never reconfigure MIG geometry or evict the device plugin on a node that is still running other jobs without draining it first; partial MIG reconfiguration corrupts the allocatable pool.

Kubernetes¶

Read the scheduler's exact reason. This is the single most important step; the Events block names every gate the job failed:
```
kubectl describe pod <pod> | sed -n '/Events:/,$p'
```
Map the message: Insufficient nvidia.com/gpu (pool exhausted or wrong resource name) → step 2; untolerated taint / node(s) didn't match Pod's node affinity/selector → step 3; a MIG resource name the node never advertises → step 4; exceeded quota → step 5.
Check the allocatable vs. used GPU pool. Confirm whether GPUs are genuinely free or already fully claimed by other pods:
```
kubectl describe node <gpu-node> | grep -E "nvidia.com/(gpu|mig)" 
# compare Allocatable vs Allocated resources blocks
```
If allocatable is non-zero but allocated equals allocatable, the pool is full, so free capacity (let jobs drain, or scale the node pool). If the node advertises a different resource name than the request, fix the request (step 4).
Inspect taints, affinity, and tolerations. GPU node pools are routinely tainted (nvidia.com/gpu=present:NoSchedule); the pod must tolerate it and its nodeSelector/affinity must match real node labels:
```
kubectl get node <gpu-node> -o jsonpath='{.spec.taints}{"\n"}{.metadata.labels}{"\n"}'
kubectl get pod <pod> -o jsonpath='{.spec.tolerations}{"\n"}{.spec.nodeSelector}{"\n"}'
```
Add the missing toleration or correct the selector to a label that exists. GPUs are requested in resources.limits only (limit is used as the request; if both are set they must be equal).
Reconcile the MIG profile. With the device plugin in single/mixed MIG strategy, nodes advertise profile-named resources (nvidia.com/mig-1g.5gb, nvidia.com/mig-3g.20gb, etc.). A request must name a profile the node's MIG geometry actually exposes (MIG partitioning). Confirm what the node offers, then either fix the request or, after cordon + drain, reconfigure MIG geometry to match:
```
kubectl describe node <gpu-node> | grep "nvidia.com/mig-"   # advertised profiles
# to change geometry (drain the node first; see runbook-mig-state-stale):
kubectl cordon <gpu-node>
kubectl drain <gpu-node> --ignore-daemonsets --delete-emptydir-data
# ... apply MIG config / GPU Operator mig.config label ...
kubectl uncordon <gpu-node>
```
Stale or half-applied MIG state is its own failure mode → the MIG-state-stale runbook.
Check ResourceQuota / namespace limits. A namespace quota on requests.nvidia.com/gpu silently blocks admission with exceeded quota:
```
kubectl describe resourcequota -n <namespace>
```
Raise the quota or reduce the request. For an all-or-nothing multi-pod job (gang), if a co-scheduler (e.g. Kueue/Volcano-style queueing) holds the job because the full set cannot be placed at once, free a contiguous block: cordon nodes running lower-priority work, drain them, and let the gang admit as a unit.

Slurm¶

Read the pending reason. The reason in parentheses is the verdict; decode it, do not resubmit blindly:
```
squeue -j <jobid> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
scontrol show job <jobid>    # full request: TRES, GRES, Partition, NodeList, Reason
```
Resources = waiting for capacity to free; Priority = higher-priority jobs ahead; ReqNodeNotAvail = a specifically required node is down/drained; AssocGrp*Limit / QOSMax* = a quota cap on GPUs.
Confirm GPU GRES availability on the target partition. The job may request a gres/gpu count or a type (gpu:a100:8) no idle node can satisfy:
```
sinfo -p <partition> -o "%N %t %C %G"     # node states + GRES
scontrol show node <node> | grep -E "Gres|State|CfgTRES|AllocTRES"
```
If nodes are drain/down, that is the block; return them via the node-fault path (the GPU-fault/RMA runbook) rather than tuning the job.
Reconcile the GRES request. A wrong GPU type, count, or --gpus-per-node that no node provides keeps the job in Resources/ReqNodeNotAvail forever. Match the request to advertised Gres from step 2, or target a partition that has the type.
Check association/QOS GPU limits when the reason is AssocGrp*Limit or QOSMax*:
```
sacctmgr show qos format=Name,MaxTRESPU,GrpTRES
sacctmgr show assoc user=<user> format=Account,User,GrpTRES,MaxTRESPU
```
Either wait for the user's running GPUs to free below the cap, or have an admin raise the limit. Do not raise limits to mask a runaway submitter.

Verification¶

k8s: the pod transitions out of Pending and binds to a node; the prior FailedScheduling reason no longer appears:

kubectl get pod <pod> -w        # Pending -> ContainerCreating -> Running
kubectl get pod <pod> -o jsonpath='{.spec.nodeName}{"\n"}'   # bound to a real node

Slurm: the job leaves PD for R and is allocated the requested GPUs:

squeue -j <jobid> -o "%.18i %.2t %R"          # state R, reason cleared
scontrol show job <jobid> | grep -E "JobState|AllocTRES|NodeList"   # AllocTRES shows gres/gpu=N

The job runs real work: nvidia-smi on the bound node shows the job's process holding the GPU(s), not idle.

Rollback¶

A scheduling fix is mostly request/quota changes, not a node mutation, so revert by undoing only what you changed:

Restore node schedulability if you cordoned/drained: kubectl uncordon <node> (k8s) or scontrol update nodename=<node> state=resume (Slurm). Never leave a node cordoned after the gang admits.
Revert quota or QOS changes once the immediate job is placed; transient over-grants become permanent capacity leaks.
Revert MIG geometry changes only via the MIG path with a drain (see the MIG-state-stale runbook); do not flip MIG mode on a node holding live jobs.
If the root cause was a node that was down/health-gated (not a request bug), the real fix is on the node. Divert to the kernel/GPU-missing runbook, the Fabric Manager runbook, or the GPU-fault/RMA runbook, and let the scheduler place the job once capacity returns.

the MIG-state-stale runbook: stale/half-applied MIG geometry (the most common MIG profile-mismatch root cause).
the kernel/GPU-missing runbook: node advertises zero allocatable GPUs (device plugin / driver).
the Fabric Manager runbook: FM down keeps NVSwitch nodes out of the schedulable pool.
the GPU-fault/RMA runbook: a drained/down node is a hardware fault, not a scheduling bug.
the inference-SLO-breach runbook: when pending capacity pressure shows up as serving latency.
operational runbooks: operational runbooks index.
troubleshooting runbook: general triage index.

References¶

Kubernetes — Schedule GPUs (nvidia.com/gpu, resources.limits): https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Kubernetes — Safely Drain a Node (cordon/drain/uncordon, PodDisruptionBudget): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
Kubernetes — kubectl drain reference (--ignore-daemonsets): https://kubernetes.io/docs/reference/kubectl/generated/kubectl_drain/
Kubernetes — Taints and Tolerations: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
Kubernetes — Resource Quotas: https://kubernetes.io/docs/concepts/policy/resource-quotas/
NVIDIA k8s-device-plugin — MIG strategies (MIG_STRATEGY, nvidia.com/mig-* resource names): https://github.com/NVIDIA/k8s-device-plugin
NVIDIA GPU Operator — MIG configuration: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html
Slurm — squeue (job state reason codes, %r/%R format): https://slurm.schedmd.com/squeue.html
Slurm — Generic Resource (GRES) scheduling (gres/gpu, --gpus): https://slurm.schedmd.com/gres.html
Slurm — scontrol (show job/node, resume state): https://slurm.schedmd.com/scontrol.html
Slurm — sacctmgr (QOS / association GRES limits): https://slurm.schedmd.com/sacctmgr.html