Host OS and kernel tuning for GPU nodes¶
Scope: host kernel/OS knobs that keep GPUs fed, namely vm.swappiness/swapoff, transparent hugepages (training vs inference), the performance CPU governor and disabling deep C-states to kill GPU bubbles, isolcpus/nohz_full plus IRQ affinity (disable irqbalance), vm.dirty_ratio and O_DIRECT/io_uring for checkpoint writes, and jemalloc/tcmalloc allocator tuning. NUMA/CPU pinning itself lives in NUMA Affinity and CPU Pinning for GPU Pipelines; this page covers the rest of the host stack.
flowchart LR
SWAP["vm.swappiness=0 / swapoff -a<br/>(removes swap-in stalls)"] --> GOAL
THP["transparent hugepages<br/>(cut page faults / TLB pressure)"] --> GOAL
GOV["performance governor + no deep C-states<br/>(removes freq scaling / wakeup latency)"] --> GOAL
ISOL["isolcpus / nohz_full + IRQ affinity<br/>(removes scheduler / interrupt jitter)"] --> GOAL
DIRTY["vm.dirty_ratio + O_DIRECT / io_uring<br/>(removes checkpoint write stalls)"] --> GOAL
ALLOC["jemalloc / tcmalloc tuning<br/>(removes allocator lock contention)"] --> GOAL
GOAL["No host-side bubbles"] --> STEADY["GPU utilization near 100%<br/>off sync barriers"]
What it is¶
A set of OS- and kernel-level settings applied to a GPU node so the host never stalls the accelerator. The CPU prepares the next batch (load, tokenize, transform), dispatches kernels, and coordinates threads while the GPU works the current batch. If host-side work is slow or the scheduler places it poorly, the GPU goes idle, a bubble, waiting on the CPU, memory, or I/O. The knobs here remove the common sources of that idle time: memory swapping, page-table overhead, frequency scaling and deep idle states, interrupt jitter, page-cache write stalls during checkpointing, and allocator lock contention.
These are not GPU-driver settings (persistence mode, MPS, MIG live in GPU Software Stack and Node Administration and NVIDIA Container Toolkit and CDI). They are pure host Linux configuration applied via sysctl, kernel command line, cpupower, /proc/irq, and environment variables.
Why it matters¶
System-level tuning is routinely skipped in favour of model-level work, but it can yield double-digit percentage gains on a well-instrumented node, and at cluster scale that translates into large compute savings.1 Concrete per-knob expectations from the book and primary sources:
- A single swapped-out page causes a multiple-orders-of-magnitude slowdown when the GPU next touches that data.1
- Hugepages give modest, real gains (roughly 3–5% throughput) by cutting page faults and TLB pressure.1
- THP background compaction injects unpredictable pauses that are damaging to latency-sensitive inference.12
- Deep C-state wakeups cost microseconds each; aggregated across a data-loader hot path they accumulate into GPU bubbles.1
The goal is steady-state: each GPU's utilization hovers near 100% and only dips at intentional synchronization barriers.1
When it is needed (and when not)¶
| Knob | Apply for training (throughput) | Apply for inference (latency) | Skip / caution |
|---|---|---|---|
vm.swappiness=0 / swapoff -a |
Yes | Yes | Need enough RAM; otherwise the OOM killer reaps the job |
| Transparent hugepages | Enable (always/madvise) |
Disable (never) or madvise |
never removes compaction stalls but forces manual hugepage management |
performance governor + no deep C-states |
Yes | Yes | Trades extra CPU power for responsiveness — fine when GPUs dominate the power budget |
isolcpus / nohz_full + IRQ pinning |
When data-loader jitter is observed | When tail latency matters | Real-time FIFO/RR priority usually unnecessary once threads are pinned; RT can starve other processes |
Raised vm.dirty_ratio |
Large checkpoints | — | High dirty ratios delay durability; pair with O_DIRECT/io_uring for latency-sensitive writes |
jemalloc/tcmalloc tuning |
Data-prep-heavy pipelines | Yes | Workload-dependent; measure before committing |
On Grace-based superchips (GH200, GB200) the CPU and GPU are coherent over NVLink-C2C, but Linux still models CPU DRAM and GPU HBM as separate pools, so these host knobs still apply.1 Per-node sysctls should be verified, not assumed: defaults differ by distribution (e.g. vm.swappiness defaults to 603).
How: implement, integrate, maintain¶
These are reference templates. Validate every value on the target hardware and workload before rolling out. None of the numbers below have been hardware-tested here.
Virtual memory and swapping¶
vm.swappiness is the rough relative IO cost of swapping versus filesystem paging; lower means avoid swap. The kernel range is 0–200 with a default of 60.3 Set it low and disable swap devices outright for the job's lifetime:
# Persist: avoid swapping except under extreme memory pressure
sudo sysctl -w vm.swappiness=0
echo 'vm.swappiness = 0' | sudo tee /etc/sysctl.d/90-gpu-node.conf
# Disable all swap devices/files until next reboot
sudo swapoff -a
# Verify swap stays at zero
free -m
vmstat 1 5
Enforce no-swap and memory limits per workload with cgroups v2 (via Docker/Kubernetes) rather than relying on global settings.1
Transparent hugepages (training vs inference)¶
THP enabled accepts always, madvise, or never; the active value is shown in brackets in the sysfs file.2 Enable for throughput-bound training, set never (or madvise) for latency-bound inference to avoid khugepaged compaction stalls.12
# Inspect current policy
cat /sys/kernel/mm/transparent_hugepage/enabled # e.g. [always] madvise never
# Runtime: disable for latency-sensitive inference
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
Make it boot-persistent on the kernel command line (append to GRUB_CMDLINE_LINUX, then regenerate grub):2
For very large preallocated pinned buffers, prefer explicit hugepages via vm.nr_hugepages/hugetlbfs for deterministic behaviour.1 When using large pinned regions raise the locked-memory limit (ulimit -l) high or unlimited, or pinning fails and falls back to swappable memory.1
CPU governor and C-states¶
Pin the CPU at maximum frequency and prevent deep idle states so wakeups don't introduce latency. cpupower frequency-set -g performance sets the governor; under the Intel P-state driver's active mode only performance and powersave are available.54
# Lock all cores to max frequency
sudo cpupower frequency-set -g performance
# Verify
cpupower frequency-info | grep -i "current policy\|governor"
Limit C-states at boot (C0 is active; deeper states sleep harder and wake slower). Many server BIOS/UEFI "high-performance" profiles set the governor to performance and disable deep C-states automatically.1 On the kernel command line:
# /etc/default/grub — cap idle depth; intel_idle.max_cstate=1 on Intel
GRUB_CMDLINE_LINUX="... processor.max_cstate=1 intel_idle.max_cstate=1 idle=poll"
idle=poll maximizes responsiveness at the cost of power and heat. Acceptable when GPUs dominate the node's power budget, but measure thermals.
CPU isolation and interrupt affinity¶
Reserve cores for data-pipeline threads and keep device interrupts on the local NUMA node. isolcpus removes CPUs from the general scheduler; nohz_full suppresses the timer tick on a core when only one thread is runnable there, and should be paired with rcu_nocbs.6 These are boot parameters:
# /etc/default/grub — isolate cores 8-15 for the data pipeline
GRUB_CMDLINE_LINUX="... isolcpus=domain,managed_irq,8-15 nohz_full=8-15 rcu_nocbs=8-15"
When isolcpus is combined with managed_irq or nohz, include the domain flag to isolate from SMP balancing/scheduling.6
Disable irqbalance (or run it with bespoke rules) and pin each GPU/NIC interrupt to a core on its own NUMA node, so no remote node services the IRQ and evicts remote cache lines:1
# Stop the rebalancer
sudo systemctl disable --now irqbalance
# Pin IRQ 142 to CPU0 (smp_affinity is a hex CPU mask)
echo 1 | sudo tee /proc/irq/142/smp_affinity
Prefer cgroup cpuset isolation in production (docker --cpuset-cpus, Kubernetes CPU management) to bind each workload to dedicated cores and memory.1 Once threads are pinned, real-time FIFO/RR priorities are usually unnecessary and risk starving other processes.1
Filesystem write-back and checkpoint I/O¶
Large checkpoint bursts fill the page cache and stall the training loop. vm.dirty_ratio is the percentage of memory that may hold dirty pages before writers block; vm.dirty_background_ratio is the percentage at which the background flushers start.3 A higher dirty ratio lets the OS batch multi-GB checkpoints before flushing:1
# Batch more dirty data in RAM to smooth large checkpoint writes
sudo sysctl -w vm.dirty_ratio=40
sudo sysctl -w vm.dirty_background_ratio=10
For latency-sensitive workflows, bypass the page cache: open checkpoint files with O_DIRECT, or use io_uring for asynchronous I/O. After each checkpoint write, drop the pages with posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) to prevent cache pressure on subsequent iterations.1 Writing checkpoints from a separate thread, or using PyTorch distributed checkpointing, further hides the cost.1
Host CPU memory allocator¶
A swappable, fragmenting allocator injects unpredictable pauses into data preparation. Tune jemalloc to shard allocations into per-CPU arenas (narenas), purge off the hot path (background_thread), and delay returning freed pages to the OS (dirty_decay_ms/muzzy_decay_ms).17 dirty_decay_ms sets the time from a page becoming unused-dirty to being purged; background_thread shifts purging to dedicated threads and improves tail latency.7
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
export MALLOC_CONF="narenas:8,dirty_decay_ms:10000,muzzy_decay_ms:10000,background_thread:true"
tcmalloc benefits from larger per-thread caches so small allocations avoid global locks and syscalls:1
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
export TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=$((512*1024*1024))
export TCMALLOC_RELEASE_RATE=16
Maintain¶
- Confirm
sysctlvalues survive reboot (/etc/sysctl.d/) and that kernel-cmdline knobs (THP,isolcpus,nohz_full, C-states) appear incat /proc/cmdlineafter boot. - Watch for regressions:
free -m/vmstat(swap at zero),cpupower frequency-info(governor stillperformance), and GPU utilization (near 100% off barriers). - Re-validate per workload: THP, allocator, and dirty-ratio settings are workload-dependent; the training and inference profiles diverge.
- In containers, mirror host policy through the security context, cgroups v2, and
--ulimit memlockso pinning and no-swap actually take effect inside the pod. See GPU Containerization Performance and Topology-Aware GPU Scheduling in Kubernetes.
References¶
- Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 3, "OS, Docker, and Kubernetes Tuning for GPU-Based Environments" — host OS/kernel tuning for GPU nodes (swappiness, THP training-vs-inference, governor/C-states, isolcpus/nohz_full, IRQ affinity, dirty_ratio, O_DIRECT/io_uring, jemalloc/tcmalloc).
- Linux kernel — Transparent Hugepage Support (
always/madvise/never,transparent_hugepage=boot param, sysfs control). https://docs.kernel.org/admin-guide/mm/transhuge.html - Linux kernel —
/proc/sys/vm/documentation (swappinessrange 0–200 default 60,dirty_ratio,dirty_background_ratio). https://docs.kernel.org/admin-guide/sysctl/vm.html - Linux kernel — CPU Performance Scaling / CPUfreq (governors, P-state active mode). https://docs.kernel.org/admin-guide/pm/cpufreq.html
- Red Hat — CPUfreq governors and
cpupower frequency-setusage. https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/power_management_guide/cpufreq_governors - Red Hat —
isolcpusoverview and flag interaction withmanaged_irq/nohz/domain. https://access.redhat.com/solutions/480473 - jemalloc —
TUNING.md(narenas,dirty_decay_ms/muzzy_decay_ms,background_thread). https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md
Related: NUMA Affinity and CPU Pinning for GPU Pipelines · GPU Containerization Performance · Topology-Aware GPU Scheduling in Kubernetes · GPU Power, Clocks, and Thermal Tuning · Performance Optimization and Tuning · GPU Software Stack and Node Administration · Glossary
-
Fregly, AI Systems Performance Engineering, Ch. 3. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
-
Linux kernel, Transparent Hugepage Support. https://docs.kernel.org/admin-guide/mm/transhuge.html ↩↩↩↩
-
Linux kernel,
/proc/sys/vm/documentation. https://docs.kernel.org/admin-guide/sysctl/vm.html ↩↩↩ -
Linux kernel, CPU Performance Scaling. https://docs.kernel.org/admin-guide/pm/cpufreq.html ↩
-
Red Hat, CPUfreq governors. https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/power_management_guide/cpufreq_governors ↩
-
Red Hat, isolcpus overview. https://access.redhat.com/solutions/480473 ↩↩
-
jemalloc TUNING.md. https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md ↩↩