Dynamic and fractional GPU sharing¶

Markdown

Scope: sharing a GPU by real, changing demand instead of a fixed partition. Covers fractional allocation (a memory ceiling plus a compute share) with schedulers like HAMi and KAI, rightsizing requests to measured utilisation, scaling idle inference toward zero, and why a memory reservation, not compute, is usually what blocks packing. This is the dynamic layer above the static primitives: time-slicing, MPS, MIG, and DRA.

What it is¶

Static GPU sharing splits a card once and leaves it: a fixed set of MIG profiles, a fixed time-slicing replica count, an MPS thread percentage. Dynamic and fractional sharing allocates by current demand and adjusts as demand moves. Two ideas combine:

Fractional allocation. A pod requests part of a GPU (a memory ceiling and a compute share) and a scheduler enforces the limit, then packs many fractional pods onto one card. The device plugin's whole-integer nvidia.com/gpu cannot express a fraction, so fractional schedulers add their own resources.
Rightsizing and elasticity. Requests track measured utilisation rather than a guessed peak, idle replicas scale toward zero, and fractions or partitions are re-derived as traffic, batch size, and model mix change.

The open-source building blocks:

HAMi (Heterogeneous AI Computing Virtualization Middleware, a CNCF Sandbox project) virtualises a GPU along device memory and core percentage, no MIG required. A pod asks for nvidia.com/gpumem (memory in MB) or nvidia.com/gpumem-percentage, and nvidia.com/gpucores (compute, where each unit is 1% of the card). It enforces the memory ceiling in-container with libvgpu.so, which hijacks the calls between libcudart and libcuda so an over-limit allocation fails with a CUDA out-of-memory; the compute cap is time-sliced, so measured utilisation oscillates around the target. (HAMi, device-core usage)
KAI Scheduler (open-sourced Apache-2.0 from the NVIDIA Run:ai platform, also a CNCF Sandbox project) adds fractional GPU (a GPU-memory fraction requested by pod annotation), gang scheduling with an automatic podgrouper, hierarchical fair-share queues, bin-packing with workload consolidation, and DRA support. (KAI Scheduler, GPU sharing)
DRA (Dynamic Resource Allocation, GA in Kubernetes 1.34) is the Kubernetes-native substrate. A pod references a ResourceClaim against a DeviceClass and constrains it with attribute filters ("a GPU with at least 40 GB", a MIG profile), so partial, shared, and topology-aware requests are first-class instead of an integer count (device plugin vs DRA). DRA models and requests devices; it does not itself carve the fraction, so it composes with the mechanisms above rather than replacing them. (Kubernetes v1.34 DRA GA)

Managed platforms (NVIDIA Run:ai, and third parties such as ScaleOps) wrap the same ideas with autotuning and dashboards; the underlying mechanisms are the open-source ones above. An earlier operator, Nebuly nos, did dynamic MIG repartitioning but is effectively dormant (last commit April 2024), so prefer DRA or KAI for new work. (nos)

The rightsizing loop

flowchart LR
  M["Measure: DCGM per-pod memory + compute"] --> R["Rightsize: request = observed usage"]
  R --> P["Pack: fractional scheduler bin-packs pods"]
  P --> S{"Idle?"}
  S -->|"no"| M
  S -->|"yes"| Z["Scale replicas toward zero"]
  Z -->|"request arrives"| W["Wake (cold start)"]
  W --> M

Why it's needed (and when)¶

Why. Static sharing strands GPUs in three recurring patterns.

Idle slices. A workload sized for its peak sits in an oversized MIG profile or a whole GPU and uses a fraction of it most of the time.
Memory blocks packing. The binding constraint on a shared GPU is usually memory, not SMs. A server that reserves most of HBM up front, for example a vLLM instance whose --gpu-memory-utilization claims its fraction at startup (default 0.9, 0.92 on recent versions) and holds it for the life of the process, leaves no room for a co-tenant even while its own compute sits idle between requests (KV cache management). Two co-located servers must have fractions that sum to about 1.0.
Static inflexibility. A fixed split cannot follow batch size, context length, model version, or a daily traffic curve; every change is a manual reconfigure and, for MIG, an operation that needs the GPU free.

The economic version: ten services that each use a tenth of a GPU still pin ten whole GPUs under integer requests, when real demand is closer to one or two. Fractional packing plus rightsizing is how that gap closes (GPU consumption models).

When to use it.

Many small or bursty inference services, notebooks, or dev workloads with variable, sub-GPU demand.
A fleet you want to bin-pack onto fewer GPUs, reclaiming idle capacity measured rather than guessed.
Inference with quiet periods, where scaling replicas to zero saves real money.

When not to.

A single large training job: give it whole GPUs (or full MIG slices). Fractional sharing only adds overhead and contention (distributed training).
Hard isolation for untrusted tenants: HAMi and KAI enforce limits in software and share a driver or context. For blast-radius and side-channel isolation use hardware MIG; for control-plane isolation add namespaces, ResourceQuota, or stronger tenant boundaries such as virtual clusters or separate clusters (security and multi-tenancy).
Latency-critical paths that cannot absorb a cold start (see failure modes).

How it works and how to operate it¶

Reference templates, not hardware-tested. Pin versions and validate on one node before a fleet roll.

Rightsize from measured utilisation¶

Measure actual per-pod GPU memory and compute, set the fractional request to observed usage, and let the scheduler pack. The measurement substrate is DCGM: the DCGM exporter runs as a DaemonSet, exposes utilisation, memory, and profiling metrics (SM occupancy, Tensor-Core activity) to Prometheus, and attributes each GPU to a pod through the kubelet pod-resources API (monitoring GPUs with DCGM, observability and monitoring). One caveat: the stock exporter was not built to attribute utilisation under fractional sharing, so a sharing-aware exporter or per-workload accounting is needed once a GPU is shared. There is no official Kubernetes spec for GPU rightsizing; it is DCGM for measurement, plus a fractional scheduler, plus a bin-packing policy.

Fractional allocation with HAMi¶

Once HAMi is installed, a pod caps memory and compute directly:

# Reference template, not hardware-tested. Requires HAMi; values are illustrative.
apiVersion: v1
kind: Pod
metadata:
  name: fractional-infer
spec:
  containers:
    - name: server
      image: your-registry/vllm:pinned
      resources:
        limits:
          nvidia.com/gpu: 1          # one virtual GPU
          nvidia.com/gpumem: 8000    # hard memory ceiling in MB (~8 GiB)
          nvidia.com/gpucores: 30    # compute cap: 30% of the card

Over-allocating past nvidia.com/gpumem fails with a CUDA out-of-memory inside the container rather than starving a neighbour. Use nvidia.com/gpumem-percentage instead of an absolute MB value when the pod may land on cards of different memory sizes. (HAMi device-core usage)

Fractional allocation with KAI¶

KAI expresses a GPU-memory fraction (or an explicit MB reservation) via a pod annotation and packs fractional pods with its bin-packing and consolidation policies; its automatic podgrouper means you do not hand-write PodGroup CRDs the way Volcano requires. Fractions and gang scheduling coexist, which matters when the same cluster runs both packed inference and multi-pod training. (KAI GPU sharing)

Scale idle inference toward zero¶

Knative Serving scales to zero with autoscaling.knative.dev/min-scale: "0"; at zero, the Activator buffers the incoming request, starts a pod, and proxies once it is ready (Knative scale-to-zero). KEDA scales to and from zero on external signals with a ScaledObject:

# Reference template. Scale a vLLM Deployment to zero, wake on queued work.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
spec:
  scaleTargetRef:
    name: vllm
  minReplicaCount: 0                  # scale to zero when idle
  maxReplicaCount: 8
  triggers:
    - type: prometheus
      metadata:
        query: sum(vllm:num_requests_waiting)   # wake when requests queue
        threshold: "1"

The cost is cold start: image pull, CUDA init, and weight load. Mitigate with a warm floor of one replica for latency-critical paths, node-local model caches and image pre-pull, and a fast loader such as the open-source NVIDIA Run:ai Model Streamer, which streams weights concurrently into GPU memory and is integrated in vLLM and SGLang. (KEDA, Run:ai Model Streamer, vLLM integration)

Dynamic MIG and DRA¶

Where hardware isolation matters but demand shifts, repartition MIG as the pending workload changes rather than fixing one geometry (see MIG for the static base, and combine MIG with time-slicing for isolation plus oversubscription). DRA is the maintained way to express this: partitionable devices and attribute-filtered claims via the NVIDIA DRA driver for GPUs (k8s-dra-driver-gpu), installed alongside the GPU Operator, which still handles drivers and MIG. Verify the driver's GPU/MIG allocation path against the release you deploy, as parts of it may still be labelled experimental. (NVIDIA DRA driver)

The DIY temptation, and why to resist most of it¶

Node labels and taints route workloads to GPU nodes but do not share a GPU; a ResourceQuota caps a namespace's GPU requests but does not subdivide a card. Both are legitimate complements to a real sharing mechanism, not substitutes for one. Beyond HAMi, user-space multiplexers such as GVirtuS and LibVF.IO are experimental or stale, and container-cgroup or runtime hacks lack memory enforcement, fault isolation, and observability. The operational burden of a hand-rolled sharing layer usually exceeds the cost of the maintained options above, or of buying hardware that supports MIG.

Validated usage & tests¶

Reference templates, not hardware-tested. Output shapes are described; no numbers are invented.

Memory limit holds. Run a container that tries to allocate more than its nvidia.com/gpumem ceiling. Expect a CUDA out-of-memory in that container, with co-tenants unaffected.
Packing works. Schedule two fractional pods whose memory fractions sum to under the card, and confirm both reach Running on the same physical GPU (check nvidia.com/gpu device-id attribution via the pod-resources API, nvidia-smi reference).
Scale-to-zero and wake. Idle the service and confirm replicas fall to zero; send one request and confirm a pod starts and answers. Time the cold start; it sets your warm-floor decision.
Rightsizing is real. Compare DCGM per-pod utilisation against the requested fraction. Persistent under-use means the request is still oversized; frequent OOM or throttling means it is undersized.

Failure modes¶

Cold-start latency spike. Scaling from zero adds image pull, CUDA init, and weight load to the first request. Keep a warm replica for latency-critical paths and use a fast loader (inference SLO breach).
Memory reservation blocks packing. A pod that reserves most of HBM (the vLLM --gpu-memory-utilization case) prevents co-location even at zero compute. Size the reservation to real KV demand (KV cache management).
Overpacking, then noisy neighbour or OOM. Software limits are not hardware isolation; oversubscribing memory or compute degrades every co-tenant. If you need guarantees, move to MIG.
Rightsizing on stale metrics. Sizing from a quiet window under-provisions the next burst. Rightsize on a representative window and leave headroom.
Dynamic MIG needs the GPU free. Repartitioning a MIG geometry requires no running workloads on the card; drain first, as with any MIG reconfigure.
Unmaintained tooling. Pre-DRA operators (for example nos) have not tracked recent MIG and DRA changes; prefer maintained paths.

References¶

HAMi (fractional GPU by memory + core %, CNCF Sandbox): https://github.com/Project-HAMi/HAMi · resource semantics: https://project-hami.io/docs/userguide/nvidia-device/specify-device-core-usage/
HAMi-core (in-container enforcement, libvgpu.so hijacking libcudart/libcuda): https://github.com/Project-HAMi/HAMi-core
NVIDIA KAI Scheduler (open-sourced from Run:ai, fractional GPU + gang scheduling): https://github.com/NVIDIA/KAI-Scheduler · https://github.com/NVIDIA/KAI-Scheduler/blob/main/docs/gpu-sharing/README.md
Kubernetes DRA GA in v1.34: https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/ · concepts: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
NVIDIA DRA driver for GPUs (k8s-dra-driver-gpu): https://github.com/NVIDIA/k8s-dra-driver-gpu
Knative scale-to-zero: https://knative.dev/docs/serving/autoscaling/scale-to-zero/ · KEDA: https://keda.sh/
NVIDIA Run:ai Model Streamer (cold-start reduction): https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/ · vLLM: https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer/
Monitoring GPUs in Kubernetes with DCGM (rightsizing measurement): https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/
vLLM gpu_memory_utilization (per-instance HBM reservation): https://docs.vllm.ai/en/stable/configuration/optimization/
Context, GPU sharing surveys: https://scaleops.com/blog/kubernetes-gpu-sharing/ · https://www.vcluster.com/blog/diy-gpu-sharing-in-kubernetes
Nebuly nos (dynamic MIG; dormant): https://github.com/nebuly-ai/nos