Bare-metal provisioning & PXE¶
Scope: network boot for fleet-scale OS install: the PXE/iPXE chain, the DHCP options that drive it (next-server, filename, client-arch, HTTPClient), TFTP vs HTTP delivery of boot artifacts, and unattended OS install via cloud-init/autoinstall, kickstart, and preseed. This is the metal-level workflow that turns a racked node into a freshly-imaged OS; the tooling that orchestrates it at scale is provisioning tools, the golden images it lays down are image management, and the network it rides is OOB network infra.
All commands, DHCP snippets, and config below are reference templates, not hardware-tested. PXE behaviour is highly firmware- and NIC-dependent; pin exact filenames, options, and installer versions against the cited vendor/project docs and validate on one node before a fleet roll.
flowchart LR
POWER["Node powers on, NIC PXE ROM"] --> DHCP["DHCP: IP, next-server, filename"]
DHCP --> NBP["TFTP or HTTP: fetch NBP"]
NBP --> IPXE["iPXE chainloads, re-requests DHCP"]
IPXE --> KERNEL["HTTP: kernel, initrd, install media"]
KERNEL --> UNATTEND["Unattended install: autoinstall, kickstart, preseed"]
UNATTEND --> FIRSTBOOT["Reboot to disk, config management"]
What it is¶
PXE (Preboot eXecution Environment) is the firmware-level mechanism by which a diskless node fetches and runs a boot program over the network. The sequence is fixed: the NIC option ROM (or UEFI network stack) broadcasts DHCP; the DHCP reply carries the boot-server address in next-server (DHCP option 66) and the network boot program (NBP) name in filename (DHCP option 67);8 the client downloads that NBP (classically over TFTP) and executes it.1 On a GPU fleet this is how every node gets its OS without a tech touching a console.
The PXE ROM itself is minimal (TFTP only, no scripting, no HTTP). The standard pattern is therefore a two-stage chainload: DHCP hands the firmware a small iPXE binary as the NBP, iPXE takes over the NIC, issues its own DHCP request, and from there pulls the real payload over HTTP with scripting, retries, and conditional logic.1 iPXE is open-source boot firmware that replaces the legacy PXE stack; it "provides a full PXE implementation enhanced with additional features such as" booting "from a web server via HTTP", booting "from an iSCSI SAN", and controlling "the boot process with a script".2
Two firmware worlds drive two NBP choices, distinguished by the client's architecture in DHCP option 93 (client-arch):18
- Legacy BIOS →
undionly.kpxe, an iPXE build that drives the NIC through the UNDI interface exposed by the PXE ROM.1 - UEFI →
ipxe.efi(carries iPXE's own NIC drivers) orsnponly.efi(uses the firmware's Simple Network Protocol driver, the UEFI analogue ofundionly).45
Modern UEFI also supports HTTP Boot natively: the firmware can fetch the NBP directly over HTTP (no TFTP), signalled by vendor class HTTPClient in DHCP option 60, with filename set to a full http:// URL.6
The OS install that runs after the kernel boots is driven by a per-distro unattended-install answer file: cloud-init/autoinstall (Ubuntu), kickstart (RHEL family), or preseed (Debian). These are covered under "How it's set up & managed".
Why it's needed (and when)¶
You cannot hand-install a fleet. Network boot is the only sane way to bring up tens to thousands of identical GPU nodes, and it is the prerequisite for everything downstream: a node has no driver stack, no Fabric Manager, no scheduler agent until it has an OS. The reasons it earns its place:
- Uniformity at scale. One DHCP/boot configuration produces bit-identical installs across the fleet; drift is the root of intermittent, non-reproducible faults (reliability & RAS). The image laid down is curated as a golden artifact in image management.
- Repeatable re-provision. A failed or RMA'd node (GPU fault RMA) is wiped and re-imaged from the same source, not hand-patched back to "probably the same".
- Diskless / immutable options. iPXE can boot a node into a RAM-resident image entirely over HTTP, which suits stateless compute nodes that should never carry local OS drift.
When you touch this path: initial cluster build, capacity adds, node replacement, and any re-image. Note the boundary: network boot lays down a base OS; the GPU driver/CUDA/Fabric-Manager stack and host tuning are a separate post-install step (Ansible bring-up, driver install & lifecycle). Do not try to bake a full GPU stack into the installer; keep the image thin and converge the rest with config management.
PXE depends entirely on the provisioning VLAN: DHCP, TFTP, and the HTTP boot server must be reachable on the node's boot NIC. If that network or DHCP is wrong, the node never starts. See OOB network infra and runbook: OOB unreachable.
How it's set up & managed¶
Three moving parts: a DHCP server that steers clients to the right NBP, a TFTP and/or HTTP server holding boot artifacts, and a per-distro unattended-install file. The examples below use ISC dhcpd and dnsmasq (both common; pick one DHCP authority per segment).
DHCP: steer by firmware type¶
The DHCP reply must hand BIOS and UEFI clients different NBPs, keyed on option 93. ISC dhcpd, the canonical iPXE example:1
# /etc/dhcp/dhcpd.conf (reference template, not hardware-tested)
option client-arch code 93 = unsigned integer 16;
next-server 10.0.0.10; # DHCP option 66: TFTP/boot server address
if option client-arch != 00:00 {
filename "ipxe.efi"; # UEFI clients
} else {
filename "undionly.kpxe"; # legacy BIOS clients
}
next-server is the boot server; filename is the NBP (DHCP options 66/67).8 Verify the option-93 values for your hardware mix. Only the BIOS (0x0000) case is universal; the various UEFI arch codes (x86-64 is 0x0009) all take the same ipxe.efi here.
The same job in dnsmasq (which can also be the TFTP server), using tags set from option 93. dnsmasq matches a tag with --dhcp-match=set:<tag>,option:client-arch,<value> and consumes it in --dhcp-boot=[tag:<tag>,]<filename>,...:7
# /etc/dnsmasq.d/pxe.conf (reference template, not hardware-tested)
enable-tftp
tftp-root=/var/lib/tftpboot
dhcp-range=10.0.0.100,10.0.0.200,12h
# Tag clients by firmware architecture (DHCP option 93 / client-arch)
dhcp-match=set:bios,option:client-arch,0
dhcp-match=set:efi-x64,option:client-arch,9
dhcp-boot=tag:bios,undionly.kpxe
dhcp-boot=tag:efi-x64,ipxe.efi
enable-tftp turns on the built-in TFTP server and tftp-root sets the directory it serves; tags from dhcp-match are referenced by dhcp-boot via tag:<tag>.7
Break the chainload loop¶
When iPXE itself boots and re-requests DHCP, it would be handed undionly.kpxe/ipxe.efi again, an infinite loop. iPXE advertises itself in DHCP option 77 (user-class) as the string iPXE;38 serve the real script only to clients carrying that user-class. In dnsmasq, --dhcp-userclass=set:<tag>,<user-class> matches it (substring):7
# Detect iPXE's second DHCP request (user-class option 77 == "iPXE")
dhcp-userclass=set:ipxe,iPXE
# First pass (no ipxe tag): hand over the iPXE NBP, as above.
# Second pass (ipxe tag set): hand over the boot script instead of the NBP.
dhcp-boot=tag:ipxe,http://10.0.0.10/boot.ipxe
This is the standard two-stage break: bare firmware gets the iPXE binary; iPXE gets the script. (The iPXE chainloading appnote frames the DHCP-side break generically as configuring the DHCP server "to hand out iPXE only for the first DHCP request", and also documents two alternatives: an autoexec.ipxe script on the TFTP server alongside the binary, or a script embedded in the iPXE build.1)
iPXE boot script (HTTP payload)¶
Once iPXE runs the script, fetch the kernel and initrd over HTTP and pass the installer its answer file. iPXE script syntax, chainable per-distro:
#!ipxe
# boot.ipxe (reference template, not hardware-tested)
# Ubuntu 24.04 live-server netboot; pass autoinstall datasource on the kernel cmdline
kernel http://10.0.0.10/ubuntu/24.04/vmlinuz ip=dhcp \
url=http://10.0.0.10/ubuntu/24.04/ubuntu-24.04-live-server-amd64.iso \
autoinstall ds=nocloud-net;s=http://10.0.0.10/autoinstall/${mac}/
initrd http://10.0.0.10/ubuntu/24.04/initrd
boot
${mac} is an iPXE built-in that expands to the client MAC, letting one script serve per-host answer directories. Confirm the kernel/initrd paths and ISO url= against the image you publish in image management.
UEFI HTTP Boot (skip TFTP)¶
For UEFI firmware that supports HTTP Boot, you can drop TFTP entirely. The client sends vendor class HTTPClient (option 60); the server echoes HTTPClient and sets filename to a full HTTP URL.6 The iPXE project's ISC dhcpd example:6
# reference template, not hardware-tested (iPXE UEFI HTTP appnote)
if option client-architecture = encode-int ( 16, 16 ) {
option vendor-class-identifier "HTTPClient";
filename "http://my.web.server/ipxe.efi";
} else {
filename "http://my.web.server/script.ipxe";
}
Architecture 16 is the UEFI HTTP Boot client type; echoing the HTTPClient vendor-class is what makes the firmware accept the HTTP filename.6
Boot artifacts on the server¶
- TFTP root (
/var/lib/tftpboot): the iPXE NBPs only:undionly.kpxe,ipxe.efi/snponly.efi. Keep this minimal; TFTP is slow and lossy, so everything past the NBP should move to HTTP.1 Prebuilt binaries come fromhttps://boot.ipxe.org/; picksnponly.efion modern UEFI with onboard NICs,ipxe.efifor older UEFI needing iPXE's own drivers.45 - HTTP root: kernels, initrds, install ISOs/squashfs, iPXE scripts, and the unattended-install answer files. Served by any plain web server.
Unattended OS install¶
The installer that the kernel launches reads a distro-specific answer file. All three are grounded below; pin the exact schema/keys to the installer version you ship.
Ubuntu autoinstall (cloud-init NoCloud). Autoinstall config is YAML with a mandatory version: 1 under a top-level autoinstall: key; delivered as cloud-init user-data with a #cloud-config header.9 The kernel cmdline autoinstall ds=nocloud-net;s=http://SERVER/PATH/ points the installer at a NoCloud seed; the nocloud-net spelling is the form the Subiquity quickstart uses.10 Cloud-init requires that the seedfrom value "consists of a URI which must contain a trailing /"; it then fetches user-data, meta-data (and optionally vendor-data, network-config) from that directory.11 So the server must expose at minimum user-data and a (often empty) meta-data file in the seed directory.10
#cloud-config
# http://10.0.0.10/autoinstall/<mac>/user-data (reference template, not hardware-tested)
autoinstall:
version: 1
identity:
hostname: gpu-node
username: ops
# password: generate with `openssl passwd -6` and pin the hash here
password: "REPLACE_WITH_CRYPTED_HASH"
ssh:
install-server: true
storage:
layout:
name: lvm
late-commands:
- curtin in-target -- systemctl enable ssh
Serve an empty companion meta-data so the NoCloud fetch succeeds:10
RHEL-family kickstart. Pass inst.ks=<location> on the kernel cmdline to point anaconda at the kickstart file; HTTP/HTTPS/NFS locations are supported, e.g. inst.ks=http://server/path/ks.cfg.12 In an iPXE script this rides the kernel line exactly like the Ubuntu example:
# kickstart fragment (reference template, not hardware-tested)
kernel http://10.0.0.10/rhel9/vmlinuz inst.ks=http://10.0.0.10/ks/${mac}.cfg \
inst.repo=http://10.0.0.10/rhel9/BaseOS ip=dhcp
initrd http://10.0.0.10/rhel9/initrd.img
The kickstart file itself (%packages, %pre/%post, partitioning) is a separate artifact; build it against the kickstart commands reference for your RHEL major version.
Debian preseed. The debian-installer reads a preseed file given by preseed/url=http://host/preseed.cfg (shortened to url= as a boot parameter); the auto=true priority=critical pair suppresses early prompts so preseeding can take over.13
# preseed fragment (reference template, not hardware-tested)
kernel http://10.0.0.10/debian/linux auto=true priority=critical \
url=http://10.0.0.10/preseed/${mac}.cfg ip=dhcp
initrd http://10.0.0.10/debian/initrd.gz
Fleet scale¶
Hand-editing DHCP per host does not scale. At fleet scale this whole pipeline (DHCP reservations, per-MAC boot scripts, image selection, and unattended-install rendering) is driven by a provisioning controller (MAAS, Warewulf, xCAT, OpenStack Ironic, or NVIDIA Base Command / Mission Control for DGX estates). Those tools own the DHCP/TFTP/HTTP machinery described here and add inventory, state, and an API; they are covered in provisioning tools. This page is the protocol layer underneath them.
Validated usage & tests¶
The checks below describe the expected shape of output; they invent no numbers. Treat this as the bring-up acceptance order for the boot path.
Confirm the DHCP server answers and offers the right NBP. With dnsmasq run in the foreground for a dry run:
Expect, on a node power-cycle, log lines showing DHCPDISCOVER/DHCPOFFER, a tags: line reflecting the matched architecture tag (e.g. bios or efi-x64), and a sent ... file ... line naming the NBP you configured. A missing tag or wrong file means the option-93 match is off.
Confirm the NBP is actually fetchable over TFTP from the server itself:
A File not found here means the artifact is absent from tftp-root or the path in filename is wrong, the single most common PXE failure.
Confirm HTTP artifacts resolve (kernel, initrd, scripts, answer files):
curl -fsI http://10.0.0.10/boot.ipxe # expect: HTTP/1.1 200
curl -fsI http://10.0.0.10/autoinstall/<mac>/user-data # expect: 200
curl -fsI http://10.0.0.10/autoinstall/<mac>/meta-data # expect: 200 (may be empty body)
Any 404 on user-data/meta-data will make cloud-init NoCloud fail and drop the Ubuntu installer to interactive mode.
On the node, watch the firmware/iPXE banner over the serial console (Redfish/IPMI SOL). Expect: a DHCP lease line, the iPXE version banner after chainload, then HTTP GET lines for the kernel and initrd. iPXE prints a clear error string (e.g. Connection timed out, No such file or directory) at the exact stage that fails. Read it; it names the cause.
For a scripted end-to-end gate, drive a real power cycle out-of-band and assert the node reaches the installer:
# reference template, not hardware-tested -- requires OOB credentials
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> chassis bootdev pxe
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <pass> power cycle
# then observe DHCP/TFTP/HTTP logs and the SOL console as above
bootdev pxe forces the next boot to network; power cycle restarts the node (IPMI protocol). Successful provisioning ends with the node rebooting to disk and a config-management agent checking in. After the OS is up, GPU readiness is a separate gate: driver/Fabric-Manager validation and dcgmi diag (install & lifecycle, diagnostics tools), and health gating before the scheduler admits work (GPU health gating).
Failure modes¶
Brief; each links its runbook.
- Node never PXE-boots, no DHCP/next-server/filename on the boot NIC. Wrong VLAN, no DHCP relay, or boot order not set to network. The node sits at "no boot device" or loops the firmware. → runbook: OOB unreachable, OOB network infra.
TFTP: File not foundafter DHCP.filenamedoes not match an artifact intftp-root, or the wrong NBP was offered for the firmware type (BIOS vs UEFI / option 93). Verify with a manualtftp get.- iPXE chainload loop. The iPXE user-class (option 77
iPXE) break is missing, so iPXE is re-handed the iPXE NBP forever. Add thedhcp-userclass=set:ipxe,iPXEgate.7 - Installer drops to interactive prompts. Ubuntu:
user-data/meta-data404 or a malformedds=nocloud-net;s=...(missing trailing/).11 RHEL: badinst.ks=URL. Debian: missingauto=true priority=critical. Check the HTTP logs for the answer-file fetch. - Fleet image drift / non-reproducible installs. Hand-tweaks instead of re-imaging, or an answer file that pulls "latest" packages. Pin the image and package set; re-image rather than patch. → runbook: image drift, image management.
References¶
- iPXE — Chainloading (undionly.kpxe / ipxe.efi,
next-server/filename, option-93 ISC dhcpd example; loop break by handing out iPXE only for the first DHCP request, or anautoexec.ipxe/embedded script): https://ipxe.org/howto/chainloading - iPXE — Homepage (iPXE "provides a full PXE implementation enhanced with additional features such as" HTTP / iSCSI / scripting): https://ipxe.org/
- iPXE — user-class setting (iPXE sends user class
iPXE; DHCP option 77, used to identify iPXE clients): https://ipxe.org/cfg/user-class - IANA — BOOTP/DHCP parameters registry (option 66 TFTP Server Name, 67 Bootfile Name, 77 User Class Information, 93 Client System Architecture; RFC 2132/3004/4578): https://www.iana.org/assignments/bootp-dhcp-parameters/bootp-dhcp-parameters.xhtml
- iPXE — UEFI HTTP boot appnote (
HTTPClientvendor class option 60,client-architecture = encode-int(16,16), HTTPfilenameURL): https://ipxe.org/appnote/uefihttp - iPXE — Build targets (
undionly.kpxe,ipxe.efi,snponly.efiand what each driver stack uses): https://ipxe.org/appnote/buildtargets - iPXE — Command reference (script syntax,
kernel/initrd/boot,${mac}): https://ipxe.org/cmd - dnsmasq — man page (
--dhcp-boot,--dhcp-match,--dhcp-userclass,--pxe-service,--enable-tftp,--tftp-root,--dhcp-rangesyntax and tag flow): https://thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html - Ubuntu / Subiquity — Autoinstall reference (
version: 1, top-levelautoinstall:key, schema): https://canonical-subiquity.readthedocs-hosted.com/en/latest/reference/autoinstall-reference.html - Ubuntu / Subiquity — Autoinstall quick start (
autoinstall ds=nocloud-net;s=http://...,user-data+ emptymeta-data): https://canonical-subiquity.readthedocs-hosted.com/en/latest/howto/autoinstall-quickstart.html - cloud-init — NoCloud datasource (
ds=nocloud;seedfromURI "must contain a trailing/"; fetchesuser-data/meta-data/vendor-data/network-config;meta-datarequired to contain aninstance-id): https://docs.cloud-init.io/en/latest/reference/datasources/nocloud.html - Red Hat — Kickstart: starting installations (
inst.ks=boot option, HTTP/NFS locations): https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/automatically_installing_rhel/starting-kickstart-installations_rhel-installer - Red Hat — Kickstart commands and options reference (RHEL 9): https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/automatically_installing_rhel/kickstart-commands-and-options-reference_rhel-installer
- Debian — Installer preseeding (
preseed/url=/url=,auto,priority=critical): https://www.debian.org/releases/stable/amd64/apb.en.html - Debian Wiki — DebianInstaller/Preseed: https://wiki.debian.org/DebianInstaller/Preseed
Related: Provisioning Tools · Image Management · OOB Network Infra · Install & Lifecycle · Glossary
-
iPXE, "Chainloading" — DHCP
next-server/filenamedeliver the NBP; BIOS usesundionly.kpxe, UEFI usesipxe.efi; ISC dhcpd example keys onoption client-arch code 93(!= 00:00→ipxe.efi, elseundionly.kpxe); the boot loop is broken by configuring DHCP "to hand out iPXE only for the first DHCP request", or alternatively by anautoexec.ipxescript on the TFTP server or a script embedded in the iPXE build. https://ipxe.org/howto/chainloading ↩↩↩↩↩↩↩ -
iPXE homepage — iPXE "provides a full PXE implementation enhanced with additional features such as" booting "from a web server via HTTP", "from an iSCSI SAN", and controlling "the boot process with a script". https://ipxe.org/ ↩
-
iPXE, "user-class" setting — "If no user class has been explicitly specified, iPXE will send the user class
iPXE"; DHCP option number 77, used to identify iPXE clients (check option 77 for the valueiPXE). https://ipxe.org/cfg/user-class ↩ -
iPXE, "Build targets" —
undionly.kpxe(BIOS, UNDI),ipxe.efi(UEFI, iPXE's own NIC drivers),snponly.efi(UEFI, firmware SNP driver). Prebuilt binaries at https://boot.ipxe.org/. https://ipxe.org/appnote/buildtargets ↩↩ -
snponly.efiis the UEFI analogue ofundionly.kpxe— it uses the firmware's built-in NIC driver (SNP/NII) rather than iPXE's drivers; recommended on modern UEFI with onboard NICs, withipxe.efireserved for older UEFI needing iPXE's own drivers. iPXE build-targets appnote. https://ipxe.org/appnote/buildtargets ↩↩ -
iPXE, "UEFI HTTP boot" appnote — UEFI HTTP Boot clients send vendor class
HTTPClient(option 60); the DHCP server must echooption vendor-class-identifier "HTTPClient"and setfilenameto a fullhttp://URL. ISC dhcpd example:if option client-architecture = encode-int ( 16, 16 ) { option vendor-class-identifier "HTTPClient"; filename "http://my.web.server/ipxe.efi"; }. https://ipxe.org/appnote/uefihttp ↩↩↩↩ -
dnsmasq man page —
--dhcp-boot=[tag:<tag>,]<filename>,[<servername>...];--dhcp-match=set:<tag>,option:<name>[,<value>]sets a tag from a client-sent option;--dhcp-userclass=set:<tag>,<user-class>sets a tag by substring match on the user-class (option 77, e.g.iPXE);--enable-tftpand--tftp-root=<dir>run the built-in TFTP server; tags from match/userclass are consumed bydhcp-bootviatag:<tag>. https://thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html ↩↩↩↩ -
DHCP option-number assignments per the IANA BOOTP/DHCP parameters registry and the defining RFCs: option 66 = TFTP Server Name (RFC 2132), option 67 = Bootfile Name (RFC 2132), option 77 = User Class Information (RFC 3004), option 93 = Client System Architecture Type (RFC 4578). https://www.iana.org/assignments/bootp-dhcp-parameters/bootp-dhcp-parameters.xhtml ↩↩↩↩
-
Ubuntu / Subiquity autoinstall reference — "At the top level is a single key,
autoinstall";versionis mandatory and currently1; config is delivered as cloud-init user-data under a#cloud-configheader (or asautoinstall.yamlon the media without the top-level key). https://canonical-subiquity.readthedocs-hosted.com/en/latest/reference/autoinstall-reference.html ↩ -
Ubuntu / Subiquity autoinstall quick start — boot the installer with kernel args
autoinstall ds=nocloud-net;s=http://<server>:<port>/; the server exposes auser-datafile (with the#cloud-config+autoinstall:payload) and an (often empty)meta-datafile in that directory. https://canonical-subiquity.readthedocs-hosted.com/en/latest/howto/autoinstall-quickstart.html ↩↩↩↩ -
cloud-init NoCloud datasource — kernel cmdline
ds=nocloud[;s=<seedfrom>]; "A validseedfromvalue consists of a URI which must contain a trailing/"; cloud-init then fetchesuser-data,meta-data,vendor-data, andnetwork-configfrom that URL. The page statesmeta-data"is required to contain aninstance-id" (it is not documented here as optionally empty — thetouch meta-dataempty-file pattern comes from the Subiquity quickstart 10). https://docs.cloud-init.io/en/latest/reference/datasources/nocloud.html ↩↩ -
Red Hat Enterprise Linux 9, "Starting Kickstart installations" — pass
inst.ks=<location>as a boot option; network locations usehttp://,https://, ornfs:(e.g.inst.ks=http://myserver.example.com/rhel9-install/my-ks.cfg); for PXE the option is appended to the boot entry'sappend/kernel line. https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/automatically_installing_rhel/starting-kickstart-installations_rhel-installer ↩ -
Debian Installation Guide (Appendix B) and Debian Wiki Preseed — the debian-installer takes
preseed/url=http://host/path/preseed.cfg(abbreviatedurl=as a boot parameter), withauto=true(setsauto-install/enable, delaying locale/keyboard) andpriority=criticalto suppress lower-priority prompts. https://www.debian.org/releases/stable/amd64/apb.en.html · https://wiki.debian.org/DebianInstaller/Preseed ↩