Markdown

The roofline model and arithmetic intensity¶

Scope: arithmetic intensity (FLOPs per byte), the roofline envelope of peak-compute and peak-bandwidth ceilings, the ridge point that separates memory-bound from compute-bound kernels, and how to read that placement to pick the optimisation (fuse, tile, raise precision) that will actually move the needle, versus one that wastes effort.

Reference numbers below are illustrative Blackwell-class figures from the cited book chapter, not hardware-measured results on any specific board. Confirm peak FLOPS and peak HBM bandwidth against the NVIDIA datasheet for your exact SKU before relying on a ridge-point value.

flowchart LR
  K["Kernel: AI = FLOPs / bytes from HBM"] --> CMP["Compare AI to ridge point"]
  CMP --> MEM["AI below ridge: memory-bound"]
  CMP --> CMPT["AI above ridge: compute-bound"]
  MEM --> ACT["Raise AI: fuse, tile, lower precision"]
  CMPT --> ACT2["Push toward peak FLOPS: Tensor Cores, ILP"]

What it is¶

The roofline model is a visualisation that bounds achievable floating-point throughput by two hardware ceilings: a horizontal line at the processor's peak FLOPS and a diagonal line whose slope is the peak memory bandwidth. Achievable performance is P = min(peak_FLOPS, AI x peak_bandwidth), where arithmetic intensity (AI) is FLOPs performed per byte transferred between off-chip global memory (HBM) and the GPU. Plotting a kernel's AI on the x-axis and its measured FLOPs/sec on the y-axis shows whether it is limited by compute or by data movement.

The two ceilings meet at the ridge point. Its x-coordinate is the AI at which a kernel transitions from memory-bound (left of the ridge) to compute-bound (right of the ridge); formally ridge = peak_FLOPS / peak_bandwidth (FLOPs/byte) (NERSC Roofline). For a representative Blackwell-class GPU the book uses ~80 TFLOPs FP32 over ~8 TB/s HBM3e, giving a ridge near 10 FLOPs/byte (80 TFLOP/s / 8 TB/s = 10) [book, Ch. 6].

A worked memory-bound case: a kernel that loads two FP32 values (8 bytes), adds them (1 FLOP), and writes one FP32 (4 bytes) moves 12 bytes for 1 FLOP: AI = 1/12 ≈ 0.083 FLOPs/byte, more than 100x below the 10 FLOPs/byte ridge, so it sits hard on the memory-bandwidth diagonal and cannot keep the ALUs busy [book, Ch. 6].

Why use it¶

Roofline tells you which optimisation class can help before you spend time on it. Optimising compute on a memory-bound kernel (or chasing bandwidth on a compute-bound one) wastes effort (Performance Optimization and Tuning). The discipline is: measure, locate the dominant ceiling, fix that one layer, re-measure.

The gap between the ceilings is widening. GPU FLOPS are outpacing memory bandwidth each generation, so more kernels land on the memory-bound side, and raising AI is increasingly the lever that matters [book, Ch. 6 & 9]. This is acute for LLM inference: the decode phase streams hundreds of GB of weights from HBM per step and is typically memory-bound, while prefill/attention compute is closer to compute-bound, so the same model occupies different points on the roofline by phase [book, Ch. 6]. AI is the metric that connects a kernel to that picture and to goodput.

When to use it (and when not)¶

Use roofline analysis when:

A kernel or step is slow and you do not yet know whether compute or memory is the limiter.
You are choosing between fusion (Kernel Fusion), tiling (Shared Memory, Bank Conflicts, and Tiling), or a precision change (Tensor Cores and Mixed Precision) and want to know which raises AI enough to cross the ridge.
You need to set a realistic performance target: the relevant ceiling (not the headline peak) is the achievable bound.

It is the wrong tool when:

Occupancy is already the bottleneck: too few warps to hide latency. Fix parallelism first (CUDA Occupancy Tuning); a kernel that under-occupies the SM will sit below both roofs regardless of AI.
The kernel is latency-bound on dependency chains or launch overhead rather than throughput-bound; roofline assumes steady-state throughput.
The algorithm's AI is intrinsically fixed (e.g. a single elementwise pass with no reuse). No tiling can raise reuse that does not exist; the only lever left is moving fewer bytes (lower precision) [book, Ch. 9].

Note: a kernel can be at 100% occupancy and still be memory-bound; high occupancy hides latency but does not raise AI [book, Ch. 6].

Architecture¶

The roofline chart has three anatomical parts. The memory diagonal rises from the origin with slope equal to peak bandwidth: at low AI, every FLOP waits on bytes, so throughput is AI x peak_bandwidth. The compute roof is the flat ceiling at peak FLOPS: no kernel exceeds it whatever its AI. The two meet at the ridge point AI* = peak_FLOPS / peak_bandwidth. Left of the ridge a kernel is bandwidth-limited (raise AI to climb the diagonal); right of it the kernel is compute-limited (push toward peak FLOPS with Tensor Cores and instruction-level parallelism). The ridge is a property of the hardware, not the kernel: it moves right each generation as FLOPS outpace bandwidth, which is why more kernels fall on the memory-bound side over time [book, Ch. 6 & 9].

flowchart LR
  ORIGIN["AI = 0"] --> DIAG["Memory diagonal: slope = peak bandwidth"]
  DIAG --> RIDGE["Ridge point at peak_FLOPS / peak_bandwidth"]
  RIDGE --> ROOF["Compute roof: flat at peak FLOPS"]
  RIDGE --> MEM["Left of ridge: memory-bound region"]
  RIDGE --> CMP["Right of ridge: compute-bound region"]

How to use it: estimate and read placement¶

Count FLOPs and the bytes that actually cross HBM (loads + stores at the precision you use), then divide. For the fused L2-norm kernel shown later, each element is read once for the sum-of-squares and read again to normalise, plus one write: ~12 bytes/element for ~3 FLOPs, so AI ≈ 3/12 = 0.25 FLOPs/byte [book, Ch. 9]. Place that against your ridge point: 0.25 is still well left of 10, so this kernel stays memory-bound and the win comes from cutting bytes and launches, not from compute.

The block below is the core roofline arithmetic: derive the ridge, compute a kernel's AI, and classify it. It checks the book's headline numbers, the boundary at the ridge, and equivalence to a slow piecewise reference across an AI sweep.

# Roofline core: ridge point, attainable performance, and memory/compute placement.
# Runnable and self-checking. Numbers are illustrative Blackwell-class figures;
# replace peak_flops and peak_bw with your SKU's datasheet values.
def ridge_point(peak_flops: float, peak_bw: float) -> float:
    """Arithmetic intensity (FLOP/byte) where the two ceilings meet."""
    return peak_flops / peak_bw

def attainable_flops(ai: float, peak_flops: float, peak_bw: float) -> float:
    """Roofline bound: min(compute roof, memory diagonal)."""
    return min(peak_flops, ai * peak_bw)

def is_memory_bound(ai: float, peak_flops: float, peak_bw: float) -> bool:
    return ai < ridge_point(peak_flops, peak_bw)

peak_flops = 80e12      # FLOP/s, FP32
peak_bw    = 8e12       # byte/s, HBM3e
ridge = ridge_point(peak_flops, peak_bw)
assert ridge == 10.0                                   # 80 TFLOP/s over 8 TB/s -> 10 FLOP/byte

# Worked memory-bound kernel: load 2 FP32 + store 1 FP32 (12 B) for 1 FLOP.
ai_add = 1 / 12
assert round(ai_add, 3) == 0.083
assert is_memory_bound(ai_add, peak_flops, peak_bw)
assert attainable_flops(ai_add, peak_flops, peak_bw) == ai_add * peak_bw   # diagonal binds
assert attainable_flops(ai_add, peak_flops, peak_bw) < peak_flops / 100    # <1% of peak compute

# Boundary and adversarial cases.
assert attainable_flops(0.0, peak_flops, peak_bw) == 0.0                   # no work, no data
assert attainable_flops(ridge, peak_flops, peak_bw) == peak_flops          # at ridge, roof binds
assert is_memory_bound(ridge - 1e-9, peak_flops, peak_bw)                  # strict transition
assert not is_memory_bound(ridge + 1e-9, peak_flops, peak_bw)
# Equivalence to a slow piecewise reference across a sweep.
for ai in [0.0, 0.1, 1.0, 9.99, 10.0, 10.01, 50.0]:
    ref = ai * peak_bw if ai < ridge else peak_flops
    assert attainable_flops(ai, peak_flops, peak_bw) == ref
print("ridge=%.1f FLOP/byte  add-kernel AI=%.3f  memory-bound  OK" % (ridge, ai_add))
# ridge=10.0 FLOP/byte  add-kernel AI=0.083  memory-bound  OK

How to integrate it: the arithmetic-intensity levers¶

Roofline is not a standalone tool; it is the decision layer that tells you which of the codebase's optimisation techniques to reach for. Three levers raise AI, each linking to its own page: fuse to cut round trips, tile to reuse bytes, and lower precision to move fewer bytes per value.

Fuse to remove round trips¶

Fusing chained ops keeps intermediates in registers so they never touch HBM, cutting the denominator. The naive z = sqrt(sin(x)) as two kernels reads and writes y to global memory; the fused form reads x once and writes z once [book, Ch. 9]. The kernel below is a reference template (CUDA, built with nvcc on a GPU, so not runnable in this doc); the numpy block after it validates the L2-normalisation math and its arithmetic intensity.

// Fused L2-normalise one [batch, hidden] row per block.
// Reads x twice (sum-of-squares, then normalise) + one write: ~12 bytes/element.
__global__ void fusedL2Norm(const float* __restrict__ x,
                            float* __restrict__ y,
                            int hidden) {
    extern __shared__ float sdata[];          // reduction buffer
    const int batch = blockIdx.x;             // one block per row
    const int tid   = threadIdx.x;
    const float* batch_ptr = x + size_t(batch) * hidden;

    float local = 0.f;
    for (int i = tid; i < hidden; i += blockDim.x) {
        float v = batch_ptr[i];
        local = fmaf(v, v, local);            // v*v + local
    }
    sdata[tid] = local;
    __syncthreads();

    for (int offset = blockDim.x >> 1; offset > 0; offset >>= 1) {
        if (tid < offset) sdata[tid] += sdata[tid + offset];
        __syncthreads();
    }

    float inv = rsqrtf(sdata[0]);             // multiply by inverse vs. divide
    float* out = y + size_t(batch) * hidden;
    for (int i = tid; i < hidden; i += blockDim.x)
        out[i] = batch_ptr[i] * inv;
}

Versus a three-kernel square -> reduce -> divide pipeline (~4 FLOPs for 36 bytes of HBM traffic after intermediate writes), the fused kernel does ~4 FLOPs for ~12 bytes: higher AI, fewer launches, better cache locality [book, Ch. 9]. In PyTorch the same effect comes for free from torch.compile / TorchInductor, which fuses elementwise chains; prefer fused library ops over Python loops of small kernels [book, Ch. 9].

The numpy block validates that the fused (vectorised) computation equals a slow element-wise reference, that outputs are unit vectors, that the fused AI is the book's 0.25 FLOP/byte and strictly above the unfused pipeline, and that a zero row (no direction) surfaces as non-finite rather than silently wrong.

import numpy as np

def l2_normalize_reference(x: np.ndarray) -> np.ndarray:
    """Slow, obviously-correct row-wise L2 normalization: y = x / ||x||_2."""
    out = np.empty_like(x)
    for r in range(x.shape[0]):
        s = 0.0
        for v in x[r]:
            s += float(v) * float(v)               # sum of squares (the kernel's reduction)
        out[r] = x[r] / np.sqrt(s)                 # multiply by 1/sqrt(sumsq)
    return out

def l2_normalize_fused(x: np.ndarray) -> np.ndarray:
    """Vectorized form matching the fused kernel: read, read, write in one pass."""
    inv = 1.0 / np.sqrt((x * x).sum(axis=1, keepdims=True))
    return x * inv

rng = np.random.default_rng(0)
x = rng.standard_normal((8, 512)).astype(np.float32)
assert np.allclose(l2_normalize_fused(x), l2_normalize_reference(x), atol=1e-4)     # fused == reference
assert np.allclose(np.linalg.norm(l2_normalize_fused(x), axis=1), 1.0, atol=1e-4)  # unit rows

# Arithmetic intensity of the fused kernel: ~3 FLOP over 12 bytes/element.
fp32 = np.dtype(np.float32).itemsize
bytes_fused = 3 * fp32                              # 2 loads + 1 store = 12 B
ai_fused = 3 / bytes_fused
assert round(ai_fused, 2) == 0.25                   # book, Ch. 9

# Adversarial: an unfused square -> reduce -> divide pipeline round-trips intermediates,
# moving more bytes for the same FLOPs, so its AI must be strictly lower.
ai_unfused = 3 / (6 * fp32)
assert ai_fused > ai_unfused
# Corruption check: a zero row has no direction; normalization must surface as non-finite.
with np.errstate(divide="ignore", invalid="ignore"):
    assert not np.isfinite(l2_normalize_fused(np.zeros((1, 4), np.float32))).all()
print("fused==reference, rows unit-norm, AI=%.2f FLOP/byte (> unfused %.3f)  OK" % (ai_fused, ai_unfused))
# fused==reference, rows unit-norm, AI=0.25 FLOP/byte (> unfused 0.125)  OK

Tile for reuse¶

Loading a tile of A and B into shared memory lets each fetched byte feed many multiply-accumulates at SRAM speed. A 32x32 tile yields 32 multiplies per loaded element, multiplying AI by the reuse factor and moving the kernel right toward the compute roof [book, Ch. 9]. See Shared Memory, Bank Conflicts, and Tiling and Kernel Fusion; coalesce the global loads first (Memory Coalescing and Vectorized Access).

def tiled_ai(base_ai: float, tile: int) -> float:
    """A tile x tile block reuses each loaded element tile times."""
    return base_ai * tile

assert tiled_ai(1.0, 32) == 32.0              # 32x32 tile -> 32 MACs per loaded element
assert tiled_ai(0.25, 32) == 8.0              # scales the base AI by the reuse factor
assert tiled_ai(1.0, 1) == 1.0                # 1x1 "tile" is no reuse: AI unchanged (boundary)
# Adversarial: reuse can only raise (never lower) AI, and grows with tile size.
assert all(tiled_ai(0.25, t) >= 0.25 for t in range(1, 65))
assert tiled_ai(0.25, 64) > tiled_ai(0.25, 32)
print("32x32 tile => 32x reuse; monotone in tile size  OK")
# 32x32 tile => 32x reuse; monotone in tile size  OK

Lower precision for fewer bytes per value¶

Halving bytes-per-value doubles AI for the same FLOPs. FP16 halves FP32 traffic; FP8 (1 byte) and Blackwell FP4 (0.5 byte) cut further. A single 128-byte transaction carries 32 FP32, 64 FP16, 128 FP8, or 256 FP4 values [book, Ch. 6]. In PyTorch, enable TF32 matmuls and BF16 autocast. The snippet below is a reference template (needs a CUDA GPU and PyTorch, so not runnable here); its names and signatures are confirmed against the PyTorch docs, torch.set_float32_matmul_precision and torch.amp.autocast. The numpy block after it validates the byte/AI math and the numerical trade-off it relies on.

import torch

torch.set_float32_matmul_precision("high")   # {'highest'|'high'|'medium'}: TF32 matmuls

with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    output = model(input)                      # FP16/BF16 compute, FP32 accumulation

BF16 keeps FP32's 8-bit exponent range, so it usually needs no loss scaling; FP16's 5-bit exponent can underflow and may need a GradScaler [book, Ch. 9]. See Tensor Cores and Mixed Precision. The numpy block confirms the byte-halving that raises AI, the 128-byte packing counts, and why BF16 is preferred: a small value that survives FP32 underflows FP16 (loss scaling rescues it) and a large one overflows FP16's ceiling.

import numpy as np

# 1. Halving bytes/value doubles AI for identical FLOPs.
assert np.dtype(np.float32).itemsize == 4
assert np.dtype(np.float16).itemsize == 2
assert np.dtype(np.float32).itemsize / np.dtype(np.float16).itemsize == 2.0

# 2. A 128-byte transaction packs more values as precision drops.
itemsize = {"fp32": 4, "fp16": 2, "fp8": 1, "fp4": 0.5}
def values_per_128B(dtype: str) -> float:
    return 128 / itemsize[dtype]
assert values_per_128B("fp32") == 32          # book, Ch. 6
assert values_per_128B("fp16") == 64
assert values_per_128B("fp8")  == 128
assert values_per_128B("fp4")  == 256

# 3. Why BF16 beats FP16 for training: FP16's 5-bit exponent underflows small gradients.
with np.errstate(over="ignore", under="ignore", invalid="ignore"):
    assert np.float16(np.float32(1e-5)) > 0    # representable (subnormal) in FP16
    assert np.float16(np.float32(1e-8)) == 0   # underflows FP16 -> gradient lost
    assert np.float32(1e-8) > 0                # survives FP32 accumulation
    # Loss scaling (what GradScaler does) rescues the underflowing value.
    assert np.float16(np.float32(1e-8) * np.float32(2.0 ** 16)) > 0
    # FP16 also overflows its 65504 ceiling; BF16's 8-bit exponent (FP32 range) would not.
    assert np.isinf(np.float16(70000.0))
    assert np.isfinite(np.float32(70000.0))
print("FP16 halves bytes (AI x2); packs 32/64/128/256 per 128B; underflows 1e-8; scaling rescues  OK")
# FP16 halves bytes (AI x2); packs 32/64/128/256 per 128B; underflows 1e-8; scaling rescues  OK

How to run it in production: profile with Nsight¶

Nsight Compute (ncu) builds the roofline chart from the Speed Of Light section. Section and set names confirmed against the NVIDIA Nsight Compute Profiling Guide (Roofline):

# Collect the roofline section for one kernel.
ncu --set roofline -o roofline_report ./your_program

# Or select the sections explicitly.
ncu --section SpeedOfLight \
    --section SpeedOfLight_RooflineChart \
    -o roofline_report ./your_program

# List what your ncu build exposes.
ncu --list-sets
ncu --list-sections

A memory-bound kernel shows high DRAM-bandwidth utilisation with low ALU utilisation and memory-throttle stalls; after fusing, tiling, or lowering precision, those memory stalls drop and the kernel's dot moves right and up toward the compute roof [book, Ch. 9]. Pair ncu (per-kernel why) with Nsight Systems nsys (timeline when); see Profiling GPUs: Nsight Systems and Nsight Compute and GPU Diagnostics and Validation. Measure DRAM traffic from the profiler rather than trusting a hand count, because caching and re-fetches shift the real byte total.

How to maintain it¶

Re-derive the ridge point per GPU generation: it is peak_FLOPS / peak_bandwidth, and both numbers, along with their ratio, shift each generation (NVIDIA Hopper Platform, NVIDIA Blackwell Datacenter Platform). Use the precision-specific peak (FP32 vs FP8 vs FP4) that matches the kernel.
Re-profile after every change; an optimisation that should raise AI sometimes regresses it via register spilling or lost coalescing. Always measure, do not assume [book, Ch. 6].
Track the placement over time as a regression signal in Observability and Monitoring; a kernel drifting back toward the memory diagonal usually means a precision or fusion path was lost.

How to scale it¶

The compute/bandwidth gap widens each generation, so scaling throughput means scaling arithmetic intensity, not merely buying more FLOPS [book, Ch. 6 & 9]. Two levers scale AI at the model level rather than the single-kernel level:

Precision reduction as models grow. Stepping FP16 -> FP8 -> FP4 halves bytes-per-value at each step, doubling AI for the same FLOPs (validated in the precision block above). This is the primary scaling lever, bounded by numerical stability; see Quantization for Inference and Tensor Cores and Mixed Precision.
Batching the memory-bound decode phase. LLM decode streams the weight matrix from HBM once per step and reuses it across the batch, so AI grows roughly linearly with batch size until decode crosses the ridge into compute-bound. This is why single-stream decode wastes FLOPS and why serving stacks batch aggressively; see Continuous Batching Internals and Goodput.

Across GPUs the same logic generalises: when a kernel's operands live on a peer device, the relevant bandwidth ceiling becomes the interconnect (NVLink or PCIe), not HBM, and the roofline is drawn against that diagonal instead (NVSwitch and NVLink).

The block below derives the decode-batching scaling law: it confirms AI grows near-linearly with batch, that batch-1 GEMV sits an order of magnitude below the ridge, and it locates the exact batch where decode crosses into compute-bound.

import numpy as np

# Decode streams one [d_out, d_in] weight matrix from HBM once and reuses it across
# the batch, so arithmetic intensity grows roughly linearly with batch size.
def decode_ai(batch: int, d_in: int, d_out: int, bytes_per_w: int = 2) -> float:
    flops     = 2 * batch * d_out * d_in           # 2 FLOP per multiply-accumulate
    w_bytes   = d_out * d_in * bytes_per_w         # weights loaded once (dominant term)
    act_bytes = batch * (d_in + d_out) * bytes_per_w
    return flops / (w_bytes + act_bytes)

ridge = 10.0                                       # illustrative ridge point (FLOP/byte)
d_in = d_out = 8192
ai1, ai64 = decode_ai(1, d_in, d_out), decode_ai(64, d_in, d_out)
assert ai64 > ai1                                  # monotone in batch
assert 0.9 < (ai64 / ai1) / 64 < 1.0               # near-linear, slightly sublinear (activation bytes)
assert 0.9 < ai1 < 1.01                            # batch-1 GEMV: ~1 FLOP/byte at BF16 (2 FLOP / 2 B per weight)
assert ai1 < ridge                                 # ... an order of magnitude below the ridge: memory-bound

# Boundary: locate the exact batch where decode crosses the ridge.
crossing = next(b for b in range(1, 4096) if decode_ai(b, d_in, d_out) >= ridge)
assert decode_ai(crossing, d_in, d_out) >= ridge
assert decode_ai(crossing - 1, d_in, d_out) < ridge
print("AI(batch=1)=%.4f  AI(batch=64)=%.2f  crosses ridge=%.0f at batch=%d  OK" % (ai1, ai64, ridge, crossing))
# AI(batch=1)=0.9998  AI(batch=64)=63.02  crosses ridge=10 at batch=11  OK

Failure modes¶

Wrong ridge from a precision-mismatched peak. Comparing an FP32 ALU kernel's AI against the FP16 or FP8 Tensor Core peak inflates the ridge and misclassifies the kernel. Use the precision-matched peak FLOPS [book, Ch. 6].
Illustrative ridge taken as ground truth. The 10 FLOPs/byte here is a book figure for a representative Blackwell-class SKU, not your board. Confirm peak FLOPS and HBM bandwidth on the datasheet before acting on a ridge value.
Byte-count error. AI counts bytes crossing HBM, not L2 or shared-memory reuse. Counting cached re-reads as HBM traffic overstates AI; ignoring uncoalesced re-fetches understates it. Read DRAM traffic from ncu rather than hand-counting (GPU Memory Hierarchy, Memory Coalescing).
Optimisation that regresses AI. Fusion that spills registers to local memory (HBM), or tiling that breaks coalescing, moves the dot left, not right. Always re-profile after a change [book, Ch. 6].
Sub-roofline placement. A kernel below both ceilings is occupancy- or latency-limited, not sitting on the memory or compute bound; roofline alone will not explain it. Fix parallelism first (CUDA Occupancy Tuning).
Precision cut that breaks numerics. FP16 without loss scaling underflows small gradients and overflows past 65504 (see the precision block), so a higher-AI kernel can still produce wrong results. Gate correctness before celebrating the AI win: prefer BF16 or add a GradScaler [book, Ch. 9] (Tensor Cores and Mixed Precision).

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly). Chapter 6, "GPU Architecture, CUDA Programming, and Maximizing Occupancy" (roofline, ridge point, AI = 0.083 worked example, ~80 TFLOPs / ~8 TB/s => ~10 FLOPs/byte ridge, precision and 128-byte transaction packing); Chapter 9, "Increasing CUDA Kernel Efficiency and Arithmetic Intensity" (tiling reuse factor, kernel fusion, fused L2-norm AI ≈ 0.25, mixed precision).
NERSC, Roofline Performance Model: ridge point I* = peak_FLOP/s / peak_bandwidth, memory-bound vs compute-bound definition: https://docs.nersc.gov/tools/performance/roofline/
NVIDIA Nsight Compute Kernel Profiling Guide: GPU Speed Of Light Roofline chart, --set roofline, SpeedOfLight_RooflineChart section: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html
PyTorch, torch.set_float32_matmul_precision (TF32 matmul precision): https://docs.pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html
PyTorch, Automatic Mixed Precision (torch.amp.autocast): https://docs.pytorch.org/docs/stable/amp.html