Markdown

Goodput: measuring useful AI throughput¶

Scope: goodput as the north-star efficiency metric for GPU clusters, the useful work per unit time normalized to peak, how to compute it, why it beats raw FLOPS and nvidia-smi utilization, Meta's Effective Training-Time Ratio (ETTR), and NVIDIA's "speed of light" ceiling.

What it is¶

Goodput is the throughput of useful work (tokens trained, tokens generated, inference requests completed) per unit time, discounting everything that does not directly advance training or inference: stalled communication, idle compute, input-pipeline waits, preemptions, and failed-job restarts. Normalize goodput by the cluster's theoretical peak throughput to get a single efficiency percentage. [Fregly, Ch. 1]

Two related framings:

Goodput (Meta): "the amount of productive work completed in aggregate per unit time," which can be normalized by maximum possible goodput to a 0 to 1 utilization score. [Kokolis et al. 2025]
Effective Training-Time Ratio (ETTR) (Meta): productive runtime divided by available wallclock time of a job run; ranges 0 (no progress) to 1 (no queueing or unproductive runtime). Productive runtime excludes retraining from the last checkpoint after an interruption and post-restart initialization overhead. [Kokolis et al. 2025]

NVIDIA terms the theoretical hardware maximum the speed of light (SOL), the ceiling goodput is normalized against. In Nsight Compute, the GPU "Speed of Light Throughput" section reports the achieved percentage of utilization with respect to the theoretical maximum, split into Compute (SM) throughput and Memory throughput. [NVIDIA Nsight Compute Profiling Guide]

Why use it¶

Headline metrics lie. Raw FLOPS and device utilization (for example nvidia-smi "GPU-Util", which only reports that a kernel was resident in a sampling window, not that it did useful math) read misleadingly high while time is burned on stalled communication, idle compute, or restarts. [Fregly, Ch. 1] Goodput captures how well hardware, software, and algorithms work together toward the goal that matters: training and serving models faster and cheaper. It is the amount of useful work done per dollar and per joule, the metric that survives a cost review.

The scale argument is the whole point of the role. Meta's "Revisiting Reliability in Large-Scale Machine Learning Research Clusters" analyzed 11 months of data from two clusters running at ~80%+ headline utilization and showed that preemptions, hardware faults, network congestion, and failure recovery erode realized throughput even when utilization looks full. The book summarizes the at-scale gap as 70 to 75% of compute lost to overheads (communication delays, suboptimal parallelization, data delays, failure recovery) despite a cluster appearing 100% utilized. [Fregly, Ch. 1; Kokolis et al. 2025]

Nuance: the same Meta paper reports that well-checkpointed large RSC-1 jobs (>1024 GPUs) sustain ETTR > 0.9. [Kokolis et al. 2025] The "70 to 75% lost" figure is the book's characterization of the general headline-utilization-vs-realized-work gap, not a contradiction of the >0.9 ETTR achievable on tightly engineered runs. When the two framings diverge, prefer the paper's measured ETTR and treat 70 to 75% as the upside available before mitigation.

Economics: at ultrascale a small efficiency gain compounds. The book notes a ~20% boost in cluster efficiency can cut hardware cost by millions in large environments. [Fregly, Ch. 1] Closing the goodput gap is the value proposition of the Performance Optimization and Tuning discipline.

When to use it (and when not)¶

Use goodput / ETTR when:

Operating multi-GPU or multi-node clusters where communication, data loading, and faults (not raw kernel speed) dominate wall time.
Justifying optimization or hardware spend: goodput ties engineering effort to cost-per-token. See Cloud, Neoclouds and Cost/Capacity.
Comparing systems end-to-end. MLPerf cautions that per-GPU numbers are not a primary cross-platform metric; use system-level, end-to-end throughput. [Fregly, Ch. 1]

Do not over-index on goodput when:

Profiling a single kernel for arithmetic efficiency. There the right lens is the roofline model and Nsight Compute's per-kernel SOL, not cluster goodput.
The bottleneck is correctness or convergence, not throughput. High goodput on a diverging run wastes compute efficiently.
A job is small and single-GPU, where queueing and restart effects are negligible and MFU/SOL already capture the picture.

Goodput is the top-level number. Roofline / arithmetic intensity, CUDA occupancy, and SOL are the per-kernel decompositions that explain why it is low.

Architecture¶

Goodput is one number sitting at the top of a decomposition. Peak throughput (the speed-of-light ceiling) is what the silicon could do; wallclock work is what actually happened; the difference splits into lost time (comm stalls, data waits, preemptions, restarts) and useful work. Efficiency is useful work over peak. The lower rows are the per-kernel lenses that explain a low top-level number.

flowchart TB
  PEAK["Peak throughput (speed of light): per_gpu_peak x num_gpus"] --> RAW["Wallclock work done"]
  RAW --> LOST["Lost time: comm stalls, data waits, preemptions, restarts"]
  RAW --> USEFUL["Useful work (goodput = useful_units / elapsed_s)"]
  USEFUL --> EFF["Efficiency = goodput / peak (0..1)"]
  USEFUL --> ETTR["ETTR = productive_runtime / available_wallclock"]
  EFF --> MFU["MFU = 6*N*tok_s / peak_flops (FLOPs-side companion)"]
  EFF --> SOL["Per-kernel SOL: Compute (SM) % and Memory %"]
  SOL --> ROOF["Roofline / occupancy / coalescing explain a low SOL"]

The measurement pipeline is: instrument the loop to emit a useful-work rate, compute MFU from that rate against the datasheet peak, then attribute the wallclock with profilers (PyTorch profiler for compute-vs-data-vs-collective, Nsight Systems for the timeline, Nsight Compute for per-kernel SOL) and map each leak to a mitigation.

How to compute it¶

Goodput is a rate; efficiency is that rate over the peak rate.

goodput        = useful_work_units / elapsed_seconds
peak_throughput= per_gpu_peak * num_gpus
efficiency     = goodput / peak_throughput          # 0..1
ETTR           = productive_runtime / available_wallclock

Worked example (training, from the book): an 8-GPU node processes 100,000 tokens in 10 s, so goodput is 10,000 tokens/s. If each GPU peaks at 1,500 tokens/s, peak is 12,000 tokens/s across 8 GPUs, giving 10,000 / 12,000 = 83.3% efficiency. [Fregly, Ch. 1]

Worked example (bottlenecked): a job that could reach 1,000 samples/s on ideal hardware but achieves only 300 samples/s (because of a poor input pipeline and excessive synchronization) runs at 30% goodput; the remaining 70% is wasted. [Fregly, Ch. 1]

The formulas below reproduce both worked examples exactly and guard the divide-by-zero boundaries. Runnable and asserted with numpy only.

# Core goodput math: reproduces the book's two worked examples plus edge cases.
import numpy as np


def efficiency(useful_work_units: float, elapsed_s: float,
               per_gpu_peak: float, num_gpus: int) -> float:
    """Realized goodput divided by aggregate peak throughput, in [0, 1]."""
    if elapsed_s <= 0 or per_gpu_peak <= 0 or num_gpus <= 0:
        return 0.0
    goodput = useful_work_units / elapsed_s
    peak = per_gpu_peak * num_gpus
    return goodput / peak


def ettr(productive_runtime_s: float, available_wallclock_s: float) -> float:
    """Effective Training-Time Ratio in [0, 1] (Kokolis et al. 2025)."""
    if available_wallclock_s <= 0:
        return 0.0
    return min(productive_runtime_s / available_wallclock_s, 1.0)


# Worked example 1 (book): 8 GPUs, 100000 tokens in 10 s, peak 1500 tok/s/GPU -> 83.3%.
eff1 = efficiency(useful_work_units=100_000, elapsed_s=10.0,
                  per_gpu_peak=1_500, num_gpus=8)
assert abs(eff1 - 10_000 / 12_000) < 1e-12
assert abs(eff1 - 0.8333333333333334) < 1e-9, eff1

# Worked example 2 (book): 300 samples/s achieved vs 1000 ideal -> 30% goodput.
eff2 = efficiency(useful_work_units=300, elapsed_s=1.0,
                  per_gpu_peak=1_000, num_gpus=1)
assert abs(eff2 - 0.30) < 1e-12, eff2

# Edge: efficiency is bounded, never exceeds 1.0 at exactly peak throughput.
eff_full = efficiency(useful_work_units=12_000, elapsed_s=1.0,
                      per_gpu_peak=1_500, num_gpus=8)
assert eff_full == 1.0, eff_full

# Adversarial / boundary: zero elapsed time must not divide-by-zero, returns 0.0.
assert efficiency(useful_work_units=100, elapsed_s=0.0,
                  per_gpu_peak=1_000, num_gpus=1) == 0.0

# ETTR: a job that spent 9 productive hours out of a 10 hour wallclock window.
assert abs(ettr(9 * 3600, 10 * 3600) - 0.9) < 1e-12

# ETTR adversarial: productive time can never exceed wallclock (clamped to 1.0).
assert ettr(11 * 3600, 10 * 3600) == 1.0

# Equivalence to a slow, explicit reference over random inputs.
rng = np.random.default_rng(0)
for _ in range(10_000):
    u = float(rng.uniform(1, 1e6))
    t = float(rng.uniform(0.1, 100))
    p = float(rng.uniform(1, 1e5))
    g = int(rng.integers(1, 1024))
    ref = (u / t) / (p * g)
    assert abs(efficiency(u, t, p, g) - ref) < 1e-9

print("formulas OK:", round(eff1, 4), round(eff2, 4), ettr(9 * 3600, 10 * 3600))

How to use it: measure the achieved rate¶

Instrument the training loop to emit a useful-work rate (tokens or samples per second), the directly comparable numerator of goodput. Only count tokens from steps that actually committed an optimizer update. Tokens reprocessed after a checkpoint restore are not productive runtime and must be excluded; that exclusion is exactly what separates ETTR from naive utilization. [Kokolis et al. 2025]

The meter below carries the committed flag that enforces the exclusion. The assertions prove that replayed tokens are dropped and that the rate/efficiency helpers never divide by zero. Runnable and asserted with the standard library only.

# GoodputMeter: counts only productive tokens; asserts the ETTR exclusion rule.
import time


class GoodputMeter:
    """Tokens/sec of useful work. Reset per logging interval.

    Only committed-step tokens count. Tokens reprocessed after a
    checkpoint restore are not productive runtime (Kokolis et al. 2025).
    """

    def __init__(self) -> None:
        self._tokens = 0
        self._t0 = time.perf_counter()

    def update(self, tokens_in_step: int, committed: bool = True) -> None:
        if committed:
            self._tokens += tokens_in_step

    def rate(self) -> float:
        dt = time.perf_counter() - self._t0
        return self._tokens / dt if dt > 0 else 0.0

    def efficiency(self, per_gpu_peak_tok_s: float, num_gpus: int) -> float:
        peak = per_gpu_peak_tok_s * num_gpus
        return self.rate() / peak if peak > 0 else 0.0


# Happy path: three committed steps of 1000 tokens each are all counted.
m = GoodputMeter()
for _ in range(3):
    m.update(1_000)
assert m._tokens == 3_000

# Edge (the point of ETTR): tokens replayed after a restart must NOT be counted.
m2 = GoodputMeter()
m2.update(1_000, committed=True)   # productive optimizer step
m2.update(5_000, committed=False)  # reprocessed after checkpoint restore
assert m2._tokens == 1_000, m2._tokens  # replayed tokens excluded

# Boundary: efficiency with a zero/absent peak returns 0.0 instead of raising.
assert m.efficiency(per_gpu_peak_tok_s=0.0, num_gpus=8) == 0.0

# Determinism: freeze the clock so rate() and efficiency() share one elapsed window.
peak, gpus = 10.0, 4
frozen = m._t0 + 2.0                      # pretend exactly 2.0 s elapsed
_real, time.perf_counter = time.perf_counter, lambda: frozen
try:
    expected_rate = m._tokens / 2.0
    assert abs(m.rate() - expected_rate) < 1e-9, (m.rate(), expected_rate)
    assert abs(m.efficiency(peak, gpus) - expected_rate / (peak * gpus)) < 1e-9
finally:
    time.perf_counter = _real

print("meter OK: counted", m2._tokens, "productive tokens (5000 replayed excluded)")

How to integrate it: MFU and the FLOPs-side companion¶

MFU (realized model FLOPs/s divided by the GPU's peak FLOPs/s) is the FLOPs-side companion to goodput and a standard cross-run efficiency number. [Kokolis et al. 2025] For a dense transformer, the common estimate is ~6 * N * D FLOPs for N parameters over D tokens (forward + backward).

Source the peak_flops_per_s from the GPU datasheet at the precision you actually run (BF16/FP8/FP4 differ by large factors); see NVIDIA Hopper Platform and NVIDIA Blackwell Datacenter Platform. Quote dense peak unless your kernels exploit structured sparsity.

The block pins the 6*N*D estimate to a known value (7B params at 10k tok/s on a 1 PFLOP/s GPU is 0.42 MFU) and guards the zero-peak boundary. Runnable and asserted with numpy only.

# MFU: dense-transformer Model FLOPs Utilization, ~6*N FLOPs/token (fwd+bwd).
import numpy as np


def mfu(params: int, tokens_per_sec: float, peak_flops_per_s: float) -> float:
    """Dense-transformer MFU in [0, 1] for well-formed inputs."""
    if peak_flops_per_s <= 0:
        return 0.0
    realized_flops_per_s = 6.0 * params * tokens_per_sec
    return realized_flops_per_s / peak_flops_per_s


# Known value: 7e9 params at 10000 tok/s on a 1 PFLOP/s (1e15) BF16 GPU.
#   realized = 6 * 7e9 * 1e4 = 4.2e14 FLOP/s ; MFU = 4.2e14 / 1e15 = 0.42.
val = mfu(params=7_000_000_000, tokens_per_sec=10_000.0, peak_flops_per_s=1e15)
assert abs(val - 0.42) < 1e-12, val

# Boundary: zero throughput -> zero MFU.
assert mfu(params=7_000_000_000, tokens_per_sec=0.0, peak_flops_per_s=1e15) == 0.0

# Adversarial: a zero/unknown datasheet peak must not divide-by-zero.
assert mfu(params=7_000_000_000, tokens_per_sec=10_000.0, peak_flops_per_s=0.0) == 0.0

# Monotonic: doubling achieved tokens/s exactly doubles MFU (linear in the rate).
assert abs(mfu(1_000_000, 200.0, 1e12) - 2 * mfu(1_000_000, 100.0, 1e12)) < 1e-15

# Equivalence to an explicit reference over random inputs.
rng = np.random.default_rng(1)
for _ in range(10_000):
    n = int(rng.integers(1e6, 1e12))
    tps = float(rng.uniform(1, 1e5))
    peak = float(rng.uniform(1e12, 1e16))
    assert abs(mfu(n, tps, peak) - (6.0 * n * tps) / peak) < 1e-9

print("mfu OK: 7B @ 10k tok/s on 1 PFLOP/s BF16 ->", round(val, 3), "MFU")

How to run it in production: decompose with profilers¶

Find where goodput leaks. Use the PyTorch profiler to attribute wall time to compute vs. data-loading vs. collectives, then Nsight Systems for the timeline and Nsight Compute for per-kernel SOL.

The following block is a reference template (it needs torch, not installed here). Pin exact tool versions and re-check flags against the linked NVIDIA/PyTorch docs before relying on any output.

# Reference template (requires torch; not executed here).
import torch
from torch.profiler import profile, ProfilerActivity, schedule

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=torch.profiler.tensorboard_trace_handler("./tb_trace"),
    record_shapes=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(loader):
        train_step(batch)
        prof.step()
        if step >= 6:
            break

The core math that profiler output feeds into is a wallclock attribution: sum per-op durations into compute, data-loading, and collective buckets, and the compute share is the goodput fraction. The numpy-only block below validates that attribution: fractions partition the wall clock, an empty trace does not divide by zero, and it matches a slow explicit reference over random traces. Runnable and asserted.

# Core math the profiler illustrates: attribute wall time into
# compute / data-loading / collective buckets, then read off goodput. numpy-only.
import numpy as np


def attribute_walltime(durations_s: np.ndarray, categories: np.ndarray) -> dict[str, float]:
    """Sum per-op durations by category; fractions must sum to 1.0.

    Goodput fraction is the compute share: only compute advances the model;
    data-loading stalls and collective (comm) waits are overhead.
    """
    total = float(durations_s.sum())
    if total <= 0:
        return {"compute": 0.0, "data": 0.0, "collective": 0.0, "goodput_frac": 0.0}
    out: dict[str, float] = {}
    for cat in ("compute", "data", "collective"):
        out[cat] = float(durations_s[categories == cat].sum()) / total
    out["goodput_frac"] = out["compute"]
    return out


# Timeline: 6 s compute, 2 s data stall, 2 s NCCL wait over one profiled window.
durs = np.array([3.0, 3.0, 1.0, 1.0, 1.0, 1.0])
cats = np.array(["compute", "compute", "data", "data", "collective", "collective"])
res = attribute_walltime(durs, cats)

assert abs(res["compute"] - 0.6) < 1e-12, res
assert abs(res["data"] - 0.2) < 1e-12
assert abs(res["collective"] - 0.2) < 1e-12
# Fractions partition the wall clock exactly.
assert abs(res["compute"] + res["data"] + res["collective"] - 1.0) < 1e-12
# Goodput fraction equals the compute share.
assert abs(res["goodput_frac"] - 0.6) < 1e-12

# Adversarial / corruption: an empty (or all-zero) trace must not divide-by-zero.
empty = attribute_walltime(np.array([]), np.array([]))
assert empty["goodput_frac"] == 0.0

# Boundary: a perfectly compute-bound window is 100% goodput.
allc = attribute_walltime(np.array([5.0, 5.0]), np.array(["compute", "compute"]))
assert allc["goodput_frac"] == 1.0

# Equivalence to a slow, explicit reference over random traces.
rng = np.random.default_rng(2)
labels = np.array(["compute", "data", "collective"])
for _ in range(5_000):
    n = int(rng.integers(1, 50))
    d = rng.uniform(0.0, 10.0, size=n)
    c = labels[rng.integers(0, 3, size=n)]
    tot = d.sum()
    ref = 0.0 if tot <= 0 else d[c == "compute"].sum() / tot
    assert abs(attribute_walltime(d, c)["goodput_frac"] - ref) < 1e-9

print("attribution OK: compute=%.2f data=%.2f collective=%.2f goodput=%.2f"
      % (res["compute"], res["data"], res["collective"], res["goodput_frac"]))

Drive the same profilers from the command line for the full timeline and per-kernel SOL:

# System timeline: spot data-loader stalls, NCCL gaps, idle GPU windows.
nsys profile -t cuda,nvtx,osrt,cudnn,cublas \
  -o goodput_timeline python train.py

# Per-kernel Speed-of-Light: how close each kernel runs to the hardware ceiling.
ncu --set full -k regex:".*" -c 20 -o kernel_sol python train.py

In Nsight Compute's GPU Speed of Light Throughput section, read Compute (SM) % and Memory %: a low SM with high Memory means memory-bound (attack via Memory Coalescing and Vectorized Access, Shared Memory, Bank Conflicts, and Tiling, Kernel Fusion); both low means latency/occupancy-bound (see CUDA Occupancy Tuning). [NVIDIA Nsight Compute Profiling Guide] Workflow detail lives in Profiling GPUs: Nsight Systems and Nsight Compute.

How to scale it: close the gap¶

Map each goodput leak to a known mitigation:

GPUs waiting on data -> cache and async prefetch; pin loader threads with NUMA Affinity and CPU Pinning for GPU Pipelines. [Fregly, Ch. 1]
GPUs idle during gradient sync -> overlap compute with communication (for example gradient bucketing / DistributedDataParallel, FSDP overlap, CUDA Streams and Concurrency). [Fregly, Ch. 1]
Kernel below SOL -> fuse and tile to raise arithmetic intensity (Roofline Model and Arithmetic Intensity, Tensor Cores and Mixed Precision, FlashAttention and Multi-Head Latent Attention).
Launch overhead dominating -> capture the steady-state region with CUDA Graphs: Capture, Replay, and Launch Overhead.
Restarts eating ETTR -> shorten checkpoint interval and restart overhead; at 100k-GPU scale with RSC-2-like failure rates, the paper indicates ~2-minute checkpoint/restart targets to hold ETTR 0.9. [Kokolis et al. 2025] See Distributed Training Platform.

How to maintain it¶

Log goodput, MFU, and ETTR continuously, not just at launch. Wire them into Observability and Monitoring and alert on regression; set up automated performance tests to catch reductions early in the development cycle. [Fregly, Ch. 1] Track the rate alongside DCGM device metrics from GPU Diagnostics and Validation so a goodput drop can be correlated with thermal throttling, ECC events, or a degraded NVLink/NIC.

Failure modes¶

The recurring ways a goodput program misleads or breaks:

Trusting nvidia-smi GPU-Util. It only reports that a kernel was resident in the sampling window, not that it did useful math, so it reads high while time burns on stalls and restarts. Cross-check against measured goodput and per-kernel SOL. [Fregly, Ch. 1]
Counting replayed tokens. Including tokens reprocessed after a checkpoint restore inflates the rate and turns ETTR back into naive utilization. Only committed optimizer steps are productive runtime. [Kokolis et al. 2025]
Wrong-precision peak in MFU. Dividing by a BF16 peak while running FP8/FP4 (or by a sparse peak while running dense) skews MFU by large factors. Pin the datasheet peak to the precision and sparsity you actually run. [Kokolis et al. 2025]
Optimizing goodput on a diverging run. High goodput on a run that is not converging wastes compute efficiently; correctness gates throughput.
Reading per-GPU numbers as a cross-platform score. MLPerf cautions that per-GPU throughput is not a primary cross-platform metric; compare system-level, end-to-end. [Fregly, Ch. 1]
Conflating the 70 to 75% lost figure with a hard ceiling. It is the headline-vs-realized gap available before mitigation, not a contradiction of the >0.9 ETTR tightly engineered runs reach. When they diverge, prefer the measured ETTR. [Fregly, Ch. 1; Kokolis et al. 2025]
Divide-by-zero at the boundaries. Zero elapsed time, zero peak, or an empty profiler trace must return 0.0, not raise; the code blocks above assert exactly this.
Checkpoint interval too long at scale. As failure rates rise with GPU count, a slow checkpoint/restart erodes ETTR even when steady-state compute looks healthy; hold ~2-minute targets at 100k-GPU scale. [Kokolis et al. 2025]

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 1, "Measuring 'Goodput' Useful Throughput," goodput definition, worked examples (83.3%, 30%), 70 to 75% lost-compute figure, speed-of-light terminology, and the measure-goodput key takeaway.
A. Kokolis, et al. (Meta), "Revisiting Reliability in Large-Scale Machine Learning Research Clusters," arXiv:2410.21680 (HPCA 2025), ETTR and goodput definitions, MFU, ETTR > 0.9 on large RSC-1 jobs, 100k-GPU checkpoint projection. https://arxiv.org/abs/2410.21680
NVIDIA Nsight Compute Profiling Guide, GPU Speed of Light Throughput section, achieved % of theoretical maximum for Compute (SM) and Memory. https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html
PyTorch Profiler documentation. https://pytorch.org/docs/stable/profiler.html
NVIDIA Nsight Systems user guide. https://docs.nvidia.com/nsight-systems/UserGuide/index.html