Markdown

MPS (multi-process service)¶

Scope: CUDA Multi-Process Service, software space-sharing that lets many cooperative processes run CUDA kernels concurrently on one GPU through a shared server context; what it isolates (and what it does not), how to start/stop and provision it, and the failure mode where one client's fatal fault propagates across every GPU it shares.

What it is¶

MPS is a runtime service that funnels CUDA work from multiple host processes into a single server context per GPU, so kernels from different processes execute concurrently on different SMs instead of being context-switched. It is the spatial counterpart to plain time-slicing (one context, serialized) and to MIG (hardware partitions). NVIDIA describes it as "a lightweight runtime service" for "cooperative multi-process CUDA" workloads. (NVIDIA MPS overview)

Three components:

Control daemon (nvidia-cuda-mps-control): starts/stops servers and brokers client connections.
Server (nvidia-cuda-mps-server): runs under the same $UID as its clients, "owns the CUDA context on the GPU and uses it to execute GPU operations for its client application processes." (Tools and Interface Reference)
Client runtime: linked into every CUDA process; transparently redirects work to the server.

All three communicate over named pipes and UNIX domain sockets in a shared pipe directory. (Architecture)

On Volta and later ("Volta MPS"), each client gets its own GPU address space: "MPS client processes have fully isolated GPU address spaces." Pre-Volta MPS shared one address space across clients. (When to Use MPS, Architecture)

Key distinction from MIG: MPS is software space-sharing with no hardware isolation of compute or memory bandwidth. There are no enforced SM partitions, no memory-bandwidth QoS, and, critically, only limited fault containment (see Failure modes). For hardware-isolated multi-tenant slices, use MIG.

Why it's needed (and when)¶

Use MPS when a single process cannot saturate the GPU and the processes are cooperative (same trust domain, ideally same user). NVIDIA's guidance:

"MPS is useful when each application process does not generate enough work to saturate the GPU."
Strong-scaling: "These cases arise in strong-scaling situations, where the compute capacity ... is increased while the problem size is held fixed."
Low occupancy: "if the application shows a low GPU occupancy because of a small number of threads-per-grid, performance improvements may be achievable with MPS."

(When to Use MPS)

Canonical fits:

MPI ranks sharing a node's GPUs: multiple ranks per GPU, each issuing small kernels.
Many small inference processes that individually leave SMs idle.
Latency-sensitive co-location where time-slicing's context-switch and serialization cost is the bottleneck.

Do not reach for MPS when:

The workload is a single large training job: give it the whole GPU; MPS only adds overhead.
You need isolation between untrusted tenants: MPS shares one server context and has no memory-bandwidth QoS and only limited fault containment. Use MIG.

MPS vs MIG vs time-slicing¶

Mechanism	Sharing model	Compute isolation	Memory isolation	Fault isolation	Typical use
Time-slicing	Temporal: one context, round-robin	None	None (shared, no limits)	None	Bursty/dev, simplest
MPS	Spatial: one shared server context, concurrent kernels	Soft (optional thread-% cap, not enforced bandwidth)	Per-client address space (Volta+); optional pinned-mem cap	Limited — fault contained to shared GPUs, not per-client	Cooperative MPI / many small procs
MIG	Hardware partitions	Hard (dedicated SM slices)	Hard (dedicated memory + L2, bandwidth QoS)	Hard (per-instance)	Multi-tenant, strict QoS

Rule of thumb: time-slice for simple bursty sharing, MPS for throughput on cooperative same-user workloads, MIG when isolation or QoS is a requirement. They compose: MPS can run inside a MIG instance to pack cooperative processes into a hardware-isolated slice. (When to Use MPS, MIG vs MPS vs time-slicing in Kubernetes)

How it's installed & managed¶

MPS ships with the NVIDIA driver; nvidia-cuda-mps-control and nvidia-cuda-mps-server are installed alongside it (CUDA driver). No separate package. Verify the binary is present:

which nvidia-cuda-mps-control

Reference template, not hardware-tested.

Multi-user system: run the daemon as root; it auto-spawns one server per $UID. Set the GPU to exclusive compute mode first so nothing else grabs the context:

export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

Single-user system: run the daemon under the same UID as the clients, and point clients at the same pipe/log directories:

export CUDA_VISIBLE_DEVICES=0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d

(Common Tasks)

-d starts the daemon in background mode; -f runs it in the foreground. Both require "enough privilege (e.g. root)." (nvidia-cuda-mps-control(1))

Stop MPS (drains servers, then exits the daemon):

echo quit | nvidia-cuda-mps-control

quit [-t TIMEOUT] "Shutdown the MPS control daemon process and all MPS servers." Return the GPU to normal afterward if you set exclusive mode:

nvidia-smi -i 0 -c DEFAULT

(Common Tasks, nvidia-cuda-mps-control(1))

Reference template, not hardware-tested.

Environment variables that matter¶

Variable	Effect
`CUDA_MPS_PIPE_DIRECTORY`	Directory of the named pipes / UNIX sockets used by control, server, and clients. Must match between daemon and clients (single-user).
`CUDA_MPS_LOG_DIRECTORY`	Where the daemon writes `control.log` (server status, commands, start/stop notices). Daemon-only.
`CUDA_VISIBLE_DEVICES`	Which GPUs the control daemon (and clients) may use.
`CUDA_MPS_ACTIVE_THREAD_PERCENTAGE`	Volta+ only. "sets the portion of the available threads that can be used by the client contexts." A soft SM cap; the driver rounds down to whole SMs.
`CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`	"limits the amount of GPU memory that is allocatable by CUDA APIs by the client process."
`CUDA_DEVICE_MAX_CONNECTIONS`	Preferred number of host-to-device work-queue connections per client.

(Tools and Interface Reference, nvidia-cuda-mps-control(1))

Provisioning compute and memory at the daemon¶

The control daemon accepts commands on stdin. Cap the default per-client SM share (default is 100), and optionally pin a per-device memory ceiling:

# all servers spawned after this get ~50% of SMs by default
echo "set_default_active_thread_percentage 50" | nvidia-cuda-mps-control

# cap allocatable device memory on device 0 (example value)
echo "set_default_device_pinned_mem_limit 0 8G" | nvidia-cuda-mps-control

Reference template, not hardware-tested. Pick the percentage and memory ceiling for your hardware and workload.

The thread percentage can be set three ways with increasing specificity: the daemon env var (all future clients), per-client via the client's own CUDA_MPS_ACTIVE_THREAD_PERCENTAGE, or per-server via the control command. "The default is 100." (Tools and Interface Reference)

Useful control commands: get_server_list, start_server -uid <user id>, get_client_list <PID>, set_default_active_thread_percentage <percentage>, get_default_active_thread_percentage, set_active_thread_percentage <PID> <percentage>, set_default_device_pinned_mem_limit <dev> <value>, get_default_device_pinned_mem_limit <dev>, quit. (Tools and Interface Reference)

Architecture flow:

flowchart LR
  C1["Client proc A - CUDA runtime"] -->|"UNIX socket"| CTL["nvidia-cuda-mps-control daemon"]
  C2["Client proc B - CUDA runtime"] -->|"UNIX socket"| CTL
  CTL -->|"spawns per UID"| SRV["nvidia-cuda-mps-server: one shared GPU context"]
  C1 -->|"submit work"| SRV
  C2 -->|"submit work"| SRV
  SRV -->|"concurrent kernels on different SMs"| GPU["GPU"]

Validated usage & tests¶

Reference templates, not hardware-tested. Output shapes are described; no specific numbers are invented.

Confirm the daemon is up. A live control daemon answers and lists running servers (empty until a client connects):

echo "get_server_list" | nvidia-cuda-mps-control

Expect: nothing (no clients yet) or one or more server PIDs once a CUDA process attaches. If the command hangs or errors, the daemon is not running or CUDA_MPS_PIPE_DIRECTORY does not match.

Confirm the server process exists once a client is running:

pgrep -a nvidia-cuda-mps

Expect: the nvidia-cuda-mps-control process and, after a client attaches, an nvidia-cuda-mps-server line.

See the MPS server own the context in nvidia-smi. Under MPS, GPU compute work is attributed to the nvidia-cuda-mps-server process, not the individual clients:

nvidia-smi

Expect: in the process table, a single M+C (MPS + Compute) entry for nvidia-cuda-mps-server rather than separate C rows per client. This is the visible signature that MPS is active rather than plain time-slicing. See nvidia-smi reference.

List a server's clients (PID from get_server_list):

echo "get_client_list <server_pid>" | nvidia-cuda-mps-control

Expect: the PIDs of every client currently attached to that server.

Smoke-test concurrency. Launch two or more small CUDA processes (for example two short framework jobs, see GPU frameworks) with the MPS env vars exported, then watch utilization:

nvidia-smi dmon -s u

Expect: with MPS, both processes make forward progress simultaneously and steady-state SM utilization is higher than running them serially under time-slicing. Do not assume a specific speedup. It depends entirely on kernel size and occupancy; measure on your hardware (fabric bring-up & benchmarking).

Failure modes¶

Brief; each links its runbook.

A client crash can take down co-tenants. This is the defining MPS hazard. On Volta+, "A fatal GPU fault generated by a Volta MPS client process will be contained within the subset of GPUs shared between all clients with the fatal fault-causing GPU," and is "reported to all the clients running on the subset of GPUs in which the fatal fault is contained," while "Clients running on other GPUs remain unaffected." In short: a fatal fault poisons the shared server for every client on the GPUs it touches, not just the offender. This is why MPS is for cooperative, same-trust workloads only. (No matching runbook yet; see operational runbooks.) (When to Use MPS)
Pre-Volta is harsher. "On pre-Volta MPS, the MPS server shuts down after encountering a fatal fault. On Volta MPS, the MPS server becomes ACTIVE again after all faulting clients have disconnected." Recovery requires all faulting clients to exit. (Architecture)
Pipe-directory mismatch. Client and daemon disagree on CUDA_MPS_PIPE_DIRECTORY, so clients silently bypass MPS (run un-accelerated) or fail to connect. Symptom: no nvidia-cuda-mps-server in nvidia-smi, per-process C rows instead of one M+C.
Compute mode not set. Without EXCLUSIVE_PROCESS on a multi-user box, a non-MPS process can grab the context and block the MPS server. Confirm with nvidia-smi -q | grep "Compute Mode".
No memory-bandwidth QoS. MPS does not partition memory bandwidth; a bandwidth-heavy client starves co-tenants. If you need QoS, the answer is MIG, not MPS.
A memory-hungry co-tenant defeats the point. MPS space-shares SMs, but it does not create separate memory pools by default. A client that reserves most of HBM up front (for example a vLLM server sizing its KV cache with --gpu-memory-utilization) leaves no room for the co-tenants MPS was meant to pack, even while its own SMs sit idle. On a shared GPU the binding constraint is usually memory, not compute: cap per-client memory (set_default_device_pinned_mem_limit, above) or right-size the reservation. See KV cache management and dynamic and fractional GPU sharing.

References¶

NVIDIA Multi-Process Service (overview): https://docs.nvidia.com/deploy/mps/index.html
When to Use MPS (memory protection, error containment, thread %): https://docs.nvidia.com/deploy/mps/when-to-use-mps.html
MPS Architecture (fatal-fault behavior, address spaces, sockets): https://docs.nvidia.com/deploy/mps/architecture.html
Appendix — Tools and Interface Reference (env vars, control commands, defaults): https://docs.nvidia.com/deploy/mps/appendix-tools-and-interface-reference.html
Appendix — Common Tasks (start/stop commands, single- vs multi-user): https://docs.nvidia.com/deploy/mps/appendix-common-tasks.html
nvidia-cuda-mps-control(1) man page: https://man.archlinux.org/man/extra/nvidia-utils/nvidia-cuda-mps-control.1.en
GPU sharing in Kubernetes — MIG vs MPS vs time-slicing (context only): https://scaleops.com/blog/kubernetes-gpu-sharing/