Markdown

Security, isolation & multi-tenancy¶

Scope: securing GPU infrastructure and isolating tenants. Hardware isolation (MIG, vGPU), Blackwell confidential computing, the out-of-band/firmware attack surface, fabric segmentation, and secrets. The dimension that turns a cluster from a lab into something that can hold someone else's data or weights.

flowchart LR
  TENANT["Tenant model"] --> GPUISO["GPU isolation"]
  TENANT --> FABRIC["Fabric segmentation"]
  TENANT --> DATA["Data and secrets controls"]
  OOB["OOB and firmware plane"] --> HARDEN["Hardening and attestation"]
  GPUISO --> PLATFORM["Multi-tenant platform"]
  FABRIC --> PLATFORM
  DATA --> PLATFORM
  HARDEN --> PLATFORM

Overview¶

A GPU cluster has attack surfaces a normal server fleet does not: accelerators shared across tenants (memory and side-channel leakage), a root-equivalent out-of-band plane (provisioning and scheduling), a firmware stack from VBIOS to GSP to BMC, a high-speed fabric that enables lateral movement, and the highest-value asset of all, model weights and data in use. The skill is matching the isolation mechanism to the tenancy model and locking the planes most people forget: OOB and firmware.

Core knowledge¶

Tenant isolation tiers (weakest to strongest)¶

Time-slicing: no isolation whatsoever, never for multi-tenant (Kubernetes for GPUs).
MPS: concurrency, but no fault or security isolation (the GPU software stack).
MIG: hardware isolation with dedicated SM/L2/memory partitions and fault containment. The correct answer for hard multi-tenant sharing of a single GPU (the GPU software stack).
vGPU (NVIDIA vGPU/virtualization): hypervisor-mediated partitioning for VM tenants.
Full pass-through / dedicated node: strongest and simplest, one tenant, whole GPU(s).

Confidential computing on Blackwell¶

The deployable end-to-end flow (CVM pairing, the SPDM/attestation handshake, key release gated on a passing verdict, and how a renter verifies a remote GPU) lives in GPU Confidential Computing & Device Attestation. The summary here:

Blackwell is the first GPU with TEE-I/O: it extends the Trusted Execution Environment to the GPU and across the NVLink/NVSwitch fabric, eliminating the PCIe-bounce bottleneck that constrained Hopper's single-GPU confidential mode.
Native encryption of the PCIe/NVLink interfaces and HBM, with keys that system software cannot access, plus attestation, at near-identical throughput to unencrypted mode.
Protects model IP and data in use, enabling confidential training, inference, and federated learning. The decisive feature when running sensitive weights on shared or rented infrastructure (cloud and cost).

Isolation support by GPU tier¶

The hardware isolation primitives above exist only on specific tiers; consumer GPUs have none of them. Match the tenancy model to the silicon (GPU generations, RTX & workstation):

MIG is available only on a subset of datacenter and pro GPUs: A100/A30, H100/H200, the Blackwell B-series (B200/B300, GB200/GB300), and the RTX PRO 6000 Blackwell (up to four isolated instances). It is not available on GeForce, nor on A40/A10, L40S, or RTX 6000 Ada.
vGPU is restricted to datacenter and workstation-pro GPUs (NVIDIA RTX vWS and the datacenter vGPU editions). It is not offered on GeForce.
Confidential Computing requires Hopper or newer; Blackwell adds TEE-I/O (the first GPU to extend the TEE across NVLink). Ampere (A100) has no confidential mode, and consumer GPUs have none.
ECC memory ships on datacenter and pro cards (A100, H100/H200, B-series, RTX PRO 6000 with ECC GDDR7); GeForce (RTX 5090/4090) has no ECC.

Consequence: hard multi-tenant isolation is not available on consumer GPUs. A GeForce node can run time-slicing or MPS only (no MIG partition, no vGPU, no ECC, no confidential mode), so it is unfit for untrusted multi-tenant sharing and belongs on dedicated single-tenant duty.

The out-of-band and firmware planes¶

The BMC (provisioning and scheduling) is a root-equivalent backdoor if exposed. Put the OOB network on a separate VLAN/physical segment (networking fabric), kill default credentials, prefer Redfish over TLS to legacy IPMI, patch BMC firmware, and never expose it to the internet.
Secure/measured boot, signed VBIOS/GSP, and firmware attestation (SPDM) guard the supply chain. Unverified firmware is a persistent, below-the-OS compromise.

Fabric and data segmentation¶

IB partitioning (PKeys) isolates tenants on InfiniBand (networking fabric); VLAN/VXLAN/VRF on Ethernet; keep compute/storage/management fabrics separate.
Kubernetes: network policies, pod security, restrict privileged GPU containers, control image provenance (sign at build, verify at admission), and treat the device-plugin/DRA path as a surface (Kubernetes for GPUs).
Data at rest: encrypted storage (storage and data) and a real secrets/KMS story.
Export controls on B300/GB300 (the Blackwell platform) are a compliance dimension for sovereign-AI and cross-border procurement scenarios (cloud and cost).

Don't-miss checklist¶

Multi-tenant on shared GPUs = MIG or vGPU, never time-slicing.
OOB/BMC segmented, no defaults, not internet-reachable; Redfish+TLS over IPMI.
Blackwell CC for sensitive weights/data on shared or rented infra, with attestation in the flow.
PKey/VLAN tenant isolation on the fabric; management plane separated.
Signed firmware and secure boot; firmware supply chain controlled.

Failure modes¶

Time-slicing used for multi-tenant: cross-tenant memory exposure and noisy neighbours.
Exposed BMC with default credentials: full node compromise, out-of-band, invisible to the OS.
Flat fabric: one tenant reaches another tenant's nodes.
Unverified firmware: persistent low-level compromise surviving reinstalls.
Weights left plaintext in HBM on rented infra because CC was not enabled.

Open questions & validation¶

MIG and vGPU isolation guarantees: what each does and does not protect against.
Blackwell confidential-computing attestation flow end-to-end.
A BMC/Redfish hardening baseline and firmware-attestation (SPDM) check.

References¶

NVIDIA confidential computing: https://www.nvidia.com/en-us/data-center/solutions/confidential-computing/
MIG user guide (isolation): https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
NVIDIA vGPU documentation: https://docs.nvidia.com/vgpu/index.html
NVIDIA Blackwell architecture (first TEE-I/O capable GPU): https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
A100 (MIG up to 7, ECC, no confidential computing): https://www.nvidia.com/en-us/data-center/a100/
H100 (MIG, built-in confidential computing): https://www.nvidia.com/en-us/data-center/h100/
RTX PRO 6000 Blackwell (MIG up to 4, vGPU, ECC GDDR7): https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000/
Redfish / DMTF (OOB management security): https://www.dmtf.org/standards/redfish
InfiniBand switching / partitioning (UFM, PKeys): https://www.nvidia.com/en-us/networking/infiniband-switching/