Runbook: training MFU regression¶

Scope: localize and fix a training throughput (MFU) regression that has fallen below baseline.

Run this when model FLOPs utilization (MFU) or step time is materially worse than the recorded baseline, but the job still runs (no hang, no XID). Severity: efficiency / cost regression. Every lost MFU point is GPU-hours and money burnt, not an outage. The job works; it just works slowly, and the cause is usually a config change, not hardware.

Reference templates on real APIs; pin versions and validate before production use.

MFU regression is a profiling problem, not a fault: find which phase of the step lost time (dataloader, comms, kernel, host), fix that phase, and prove the recovery against the same baseline. Optimization techniques are in performance tuning; the training-config surface is in distributed training; storage/dataloader in storage and data.

Trigger¶

MFU below baseline (or step time / tokens-per-second-per-GPU above baseline) by a meaningful margin, sustained, on the same model + parallelism + hardware (observability, the SLO/SLI catalog).
Regression appears after a config or code change (new commit, new image, new batch/parallelism setting) or after a fleet change.

Pre-checks¶

A baseline must exist. Without a recorded MFU / step-time baseline for this exact model + parallelism + GPU, there is nothing to regress against. Establish one first (performance tuning, the SLO/SLI catalog). Record it before changing anything.
Rule out a partial fabric stall masquerading as slowness: if comms time is dominant, it may be the NCCL/transport path (the NCCL-hang runbook).
Pin the change set: identify the last-known-good commit/image so the regression is a single variable (SRE and MLOps practices).

Flow¶

flowchart TB
    A["MFU below baseline"] --> B["Read SM-active / tensor-active, not GPU-util"]
    B --> C["Profile one step with nsys"]
    C --> D{"Dominant bottleneck?"}
    D -->|"input starved"| E["Dataloader / storage"]
    D -->|"comms bound"| F["NCCL / parallelism"]
    D -->|"kernel bound"| G["Kernel / precision"]
    D -->|"host bound"| H["CPU launch / sync"]
    E --> I["Apply fix, re-profile"]
    F --> I
    G --> I
    H --> I
    I --> J["Verify vs baseline"]

Procedure¶

Read the right utilization signal. Ignore nvidia-smi "GPU-Util" (it reports only that a kernel is resident, not that the SMs/tensor cores are busy). Use SM-active and tensor-active from DCGM profiling (observability):
```
dcgmi dmon -e 1002,1004     # DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
```
Low SM-active = the GPU is starved (input/host/comms). High SM-active but low tensor-active = a precision or kernel-shape problem.
Profile a single representative step with Nsight Systems to see where wall time goes (compute vs NCCL vs idle gaps vs host) (performance tuning):
```
nsys profile -t cuda,nvtx,osrt --gpu-metrics-device=all \
  -o step.qdrep --capture-range=cudaProfilerApi <launch cmd>
```
Open the timeline and look for the gaps: idle GPU between steps (input/host), long NCCL regions (comms), or short kernels back-to-back (launch/host bound).
Locate the dominant bottleneck from the trace:
Dataloader / storage: GPU idle waiting at the start of each step; data path is the limiter (storage and data). Raise num_workers, prefetch, or fix the source throughput.
Comms: NCCL regions dominate; revisit parallelism layout (TP/PP/DP split, overlap) and transport (distributed training, the NCCL-hang runbook).
Kernel / precision: high SM-active, low tensor-active; a kernel changed shape or lost FP8/BF16 fast paths.
Host: many small launches, CPU-bound dispatch, or stray device syncs; reduce host overhead / sync points.
Apply the single matching fix for the dominant phase from performance tuning (or the config surface in distributed training): change one thing, not several.
Verify the physical link is not the culprit. A GPU silently negotiated to a lower PCIe Gen / width, or ACS left on, caps host-GPU and P2P bandwidth and shows up as a comms/input stall:
```
nvidia-smi --query-gpu=index,pcie.link.gen.current,pcie.link.width.current --format=csv
ssh "$NODE" 'lspci -vvv | grep -i ACSCtl'   # ACS must be off for P2P/GDR
```
Current Gen/width below the platform max → reseat / investigate before blaming software.

Verification¶

MFU (and step time / tokens-per-sec-per-GPU) returns to the recorded baseline on the same model + parallelism + hardware (the SLO/SLI catalog).
Record before/after in the change ticket: the baseline number, the regressed number, the post-fix number, and the single variable changed, so the win is auditable and the baseline is updated (SRE and MLOps practices).
A re-profile of one step shows the previously-dominant gap is gone (e.g. GPU no longer idles between steps).
DCGM SM-active / tensor-active are back to baseline levels under steady state (observability).

Rollback¶

If the regression was introduced by a config or code change and the fix is not yet found, revert the offending commit/config via GitOps to restore baseline throughput, then investigate offline (SRE and MLOps practices):

git revert <regressing-commit>      # config-as-code; Argo CD / Flux re-syncs the cluster
# or roll the training image tag back to the last-known-good pinned digest

Never leave a known MFU regression running in production. The cost accrues every GPU-hour. Roll back first, fix second.

the NCCL-hang runbook: NCCL hang / collective stall (a comms-bound regression can deepen into a full stall).
the GPU-fault runbook: GPU fault / RMA (if profiling reveals a degraded GPU, not a config issue).
the driver-upgrade runbook: Driver upgrade (a driver/CUDA bump can regress kernel fast paths).
the checkpoint-recovery runbook: Checkpoint recovery (resume cleanly after a rollback).
operational runbooks: Operational runbooks index.

References¶

DCGM field identifiers (DCGM_FI_PROF_SM_ACTIVE 1002, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE 1004): https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
Nsight Systems (step profiling, GPU metrics): https://docs.nvidia.com/nsight-systems/UserGuide/index.html
nvidia-smi manual (PCIe pcie.link.gen/width.current): https://docs.nvidia.com/deploy/nvidia-smi/index.html
"What Shapes Do Matrix Multiplications Like?" / MFU background (Megatron-LM paper, arxiv 1909.08053): https://arxiv.org/abs/1909.08053
PaLM (MFU as an efficiency metric, arxiv 2204.02311): https://arxiv.org/abs/2204.02311
Argo CD (GitOps revert / re-sync): https://argo-cd.readthedocs.io/en/stable/