Markdown

vLLM-Omni: disaggregated serving for any-to-any multimodal models¶

Scope: vLLM-Omni (arXiv 2602.02204), the vLLM-project system that serves any-to-any multimodal models by splitting them into independently served stages: the stage-graph abstraction, the disaggregated execution backend (a vLLM engine per autoregressive stage, a dedicated diffusion engine for DiT stages), streaming stage output, and the unified connector that moves embeddings, hidden states, and media tensors between stages. This is a different disaggregation axis from the prefill/decode split of disaggregated inference: stages here are heterogeneous model components, not phases of one model. The sizing logic for balancing pools carries over from disaggregation rate matching; the general serving context is inference serving.

Benchmark numbers are the paper's (two 80 GB accelerators, vLLM 0.12.0 era) and were not reproduced here. Install and API snippets are reference templates from the project docs as of 2026-07; the project moves fast (releases track upstream vLLM), so verify against the repo. The numpy example is executed and asserted.

flowchart TB
  IN["Inputs: text + image / video / audio"] --> ENC["Multimodal encoders<br/>(separate stage or folded into the LLM stage)"]
  ENC --> THINK["AR stage: Thinker LLM<br/>(vLLM engine, own pool, own batching)"]
  THINK -->|"Thinker2Talker transfer fn"| TALK["AR stage: Talker LLM<br/>(vLLM engine, audio codec tokens)"]
  THINK --> TXT["Text output"]
  TALK -->|"Talker2Vocoder transfer fn,<br/>plus streaming stage output"| VOC["Generator stage: Vocoder / DiT<br/>(diffusion engine, own pool)"]
  VOC --> WAV["Audio / image / video output"]
  ORCH["Orchestrator: request routing,<br/>stage management"] -.-> THINK
  ORCH -.-> TALK
  ORCH -.-> VOC
  CONN["Unified connector: queues / shared memory,<br/>Ray + Mooncake (TCP or RDMA) across nodes"] -.-> TALK
  CONN -.-> VOC

What it is¶

vLLM-Omni extends vLLM to models whose output is not just text. Any-to-any models bundle several heterogeneous components into one pipeline: Qwen-Omni chains a Thinker LLM (text tokens), a Talker LLM (audio codec tokens), and a vocoder (waveforms; a DiT in Qwen2.5-Omni, a lightweight CNN in Qwen3-Omni); GLM-Image feeds a 9B autoregressive LLM into a 7B single-stream DiT; BAGEL routes understanding and generation through separate Mixture-of-Transformers experts; LongCat-Flash-Omni pairs a 560B MoE backbone with an LSTM/CNN audio decoder.² Engines built for one paradigm cannot express these pipelines: vLLM and SGLang encapsulate a step-centric AR decode loop, Diffusers-style stacks serve DiT denoising, and neither represents coordinated execution across both. Developers who wire the components together by hand lose continuous batching, chunked prefill, and per-component resource allocation, which is exactly the performance the serving frameworks exist to provide.¹

vLLM-Omni's answer is a stage graph: nodes are model stages (AR LLM, DiT, CNN and other generators), and edges are user-defined stage-transfer functions that transform and route intermediate data. Each stage runs on its own execution engine with its own scheduler, batching, memory budget, and parallelism configuration, and a unified connector moves data between them, so the whole pipeline is disaggregated end to end.¹ The project is an official vLLM-project repository (released November 2025, first stable release 0.14.0 in February 2026) and tracks upstream vLLM releases; as of mid-2026 it ships an OpenAI-compatible server, tensor/pipeline/data/expert parallelism per stage, and CUDA, ROCm, MUSA, NPU, and XPU backends.⁵

Why use it¶

Monolithic multimodal serving wastes most of the hardware. Against the default Transformers implementation on two 80 GB accelerators, vLLM-Omni cuts job completion time by 61.6% on Qwen2.5-Omni and by 91.4% on Qwen3-Omni (real-time factor down 61.4% and 90.7% respectively). Per-stage throughput shows where the gain lives: 12.97x Thinker and 7.98x Talker tokens/s on Qwen3-Omni, whose 30B Thinker amortizes the optimized execution pipeline far better than Qwen2.5-Omni's 7B.³
Two-stage AR+generator models gain too. BAGEL image generation at 1024x1024 drops from 23.12 s to 9.64 s per job for text-to-image (2.40x) and from 41.39 s to 11.12 s for image editing (3.72x); MiMo-Audio text-to-speech falls from a real-time factor of 1.39 to 0.60 without execution-graph compilation and 0.12 with it, an 11.58x speedup.³
The diffusion engine stands alone as well. On DiT-only workloads (Qwen-Image and Qwen-Image-Edit at 1024x1024; Wan2.2 text/image-to-video at 480x640, 80 frames), it outperforms Diffusers by 1.26x overall by reusing vLLM's operator optimizations and attention backends plus DiT-specific ones (SAGE and TurboAttention, TeaCache and cache-dit caching, RingAttention context parallelism, Ulysses sequence parallelism).³
Transfer overhead is noise. The unified connector moves Thinker-to-Talker payloads in 5.49 ms over shared memory (8.28 ms over Mooncake) and Talker-to-Vocoder in 0.53 ms (3.34 ms), against end-to-end inference measured in tens of seconds.⁴

When to use it (and when not)¶

Use it for any model whose pipeline contains more than one generator: Thinker-Talker audio chat, AR-plus-DiT image generation and editing, TTS stacks, and video generation. The supported-model list (Qwen3-Omni, BAGEL, GLM-Image, MiMo-Audio, Qwen-Image, Wan2.2, FLUX, Cosmos, plus TTS families) is the practical gate; check it first.⁵
Use it when stages have visibly different appetites: the paper's running example allocates more memory to the 30B Thinker but more compute and parallelism to the smaller, compute-hungry Talker, which is only expressible when stages are separate engines.¹
For text-only models, plain vLLM is the right tool; the omni layer adds a stage graph you do not need, and the prefill/decode split of disaggregated inference already covers the phase axis (vLLM-Omni's connector remains compatible with encode-prefill-decode disaggregation within a stage).
Mind the maturity gradient. The stage abstraction requires the model's pipeline logic to be expressed as forward, preprocess, and transfer functions; a model family not yet ported needs that development investment (see How to develop with it).
Single-GPU hobby deployments still work (stages time-share one device) but forfeit the resource-allocation benefits that motivate the design.

Architecture¶

An orchestrator manages stage execution and request routing. Each stage runs an independent engine: vLLM for AR stages (own scheduler, KV manager, model runner, with chunked prefill and execution-graph compilation inherited), and a dedicated diffusion engine for DiT stages. Users express a model as three function types: a step-centric forward per stage, a preprocess per stage that reconstructs stage inputs and runs every iteration (the Talker re-concatenates Thinker hidden states at each decode step), and one-shot stage-transfer functions on graph edges (Thinker2Talker, Talker2Vocoder). A per-request dictionary carries intermediate state that transfer and preprocess functions read and update.¹

Streaming stage output lets a downstream stage start on partial upstream results: the vocoder begins synthesizing as soon as the Talker emits its first codec tokens, which overlaps stage execution and cuts time-to-first-token for the final modality. The unified connector generalizes vLLM's KV-transfer interface to embeddings, hidden states, and audio/image tensors: inline control queues for small payloads and shared memory for large ones on a single node; Ray orchestration with a Mooncake-based TCP/RDMA connector (a put/get interface with metadata in the control plane) across nodes. The same layer handles intra-stage transfers (prefill/decode KV, encoder-to-prefill multimodal cache), so the phase-level disaggregation this KB covers elsewhere composes with the stage-level one.¹

The scheduling problem the stage graph exposes is pool sizing per stage, and its arithmetic is the same rate-matching logic as prefill/decode pool sizing, one dimension higher. The executed model below validates the four operational claims: pipelined throughput is the bottleneck stage's capacity (confirmed by fluid simulation), rebalancing a fixed instance budget raises it, the monolithic pattern is strictly worse on identical hardware, and a traffic-mix shift moves the bottleneck so a stale allocation loses to a re-matched one:

# omni_stages.py - validated: the stage-disaggregation arithmetic behind
# vLLM-Omni's stage graph. First-order fluid model of heterogeneous pipeline
# scheduling; validates the scheduling claims, not a rerun of the paper.
from itertools import product

import numpy as np


def capacities(service_s: list[float], instances: list[int]) -> list[float]:
    """Per-stage capacity in requests/s: instances over per-request service time."""
    assert len(service_s) == len(instances) and all(n >= 1 for n in instances)
    return [n / s for s, n in zip(service_s, instances)]


def pipeline_throughput(service_s: list[float], instances: list[int]) -> float:
    """Steady-state pipelined (disaggregated) throughput: the bottleneck capacity."""
    return min(capacities(service_s, instances))


def monolithic_throughput(service_s: list[float], replicas: int) -> float:
    """Monolithic baseline (the paper's Transformers pattern): each request holds
    a whole pipeline replica end to end, so stages never overlap across requests."""
    assert replicas >= 1
    return replicas / sum(service_s)


def simulate(service_s: list[float], instances: list[int], rate: float,
             steps: int = 3000) -> np.ndarray:
    """Fluid queues (requests) per stage over 1 s steps; column j is stage j."""
    q = np.zeros(len(service_s))
    hist = np.zeros((steps, len(service_s)))
    for t in range(steps):
        inflow = rate
        for j, (s, n) in enumerate(zip(service_s, instances)):
            q[j] += inflow
            served = min(q[j], n / s)
            q[j] -= served
            inflow = served
        hist[t] = q
    return hist


def best_allocation(service_s: list[float], budget: int) -> tuple[int, ...]:
    """Brute-force optimal instance split for a fixed budget (>=1 per stage)."""
    allocs = [a for a in product(range(1, budget + 1), repeat=len(service_s))
              if sum(a) == budget]
    return max(allocs, key=lambda a: pipeline_throughput(service_s, list(a)))


# Workload A (audio-chat mix). Per-instance service seconds per request:
# thinker (AR text) 0.4, dit (denoising) 1.0, vocoder (waveform) 0.2.
SVC_A = [0.4, 1.0, 0.2]

# 1) The fluid simulation confirms the analytic bottleneck capacity.
alloc = [2, 3, 2]                                    # 7 instances
tp = pipeline_throughput(SVC_A, alloc)               # min(5.0, 3.0, 10.0) = 3.0 req/s
assert tp == 3.0
assert simulate(SVC_A, alloc, rate=2.9)[-1].max() < 1.0        # bounded below capacity
h = simulate(SVC_A, alloc, rate=3.5)                            # 0.5 req/s over capacity
assert np.all(np.diff(h[10:, 1]) > 0) and h[-1, 1] > 1e3        # only the DiT queue grows
assert h[-1, 0] < 1.0 and h[-1, 2] < 1.0

# 2) Rebalancing the same 7 instances (vocoder pool 2 -> 1, DiT pool 3 -> 4)
# raises end-to-end throughput by a third: 3.0 -> 4.0 req/s.
alloc_rm = [2, 4, 1]
tp_rm = pipeline_throughput(SVC_A, alloc_rm)         # min(5.0, 4.0, 5.0) = 4.0 req/s
assert tp_rm == 4.0 and tp_rm > tp
assert best_allocation(SVC_A, 7) == (2, 4, 1)        # and it is the optimum

# 3) Disaggregation vs the monolithic pattern on the same 7 instances: a
# monolithic replica pins one instance per component (3 each, 2 replicas fit,
# one instance idles) and overlaps nothing, so it serves 2/1.6 = 1.25 req/s.
tp_mono = monolithic_throughput(SVC_A, replicas=7 // len(SVC_A))
assert np.isclose(tp_mono, 1.25) and tp_rm > 3 * tp_mono

# 4) Mix shift: long-text traffic (thinker 0.4 -> 1.2 s/req) moves the
# bottleneck from the DiT stage to the thinker under the stale allocation,
# and a different split of the same budget is now strictly better.
SVC_B = [1.2, 1.0, 0.2]
caps_a = capacities(SVC_A, alloc_rm)
caps_b = capacities(SVC_B, alloc_rm)
assert int(np.argmin(caps_a)) == 1                   # workload A: DiT-bound
assert int(np.argmin(caps_b)) == 0                   # workload B: thinker-bound
tp_stale = pipeline_throughput(SVC_B, alloc_rm)
best_b = best_allocation(SVC_B, 7)
assert best_b != tuple(alloc_rm)
assert pipeline_throughput(SVC_B, list(best_b)) > tp_stale

print(f"workload A: alloc {alloc} -> {tp:.2f} req/s; rebalanced {alloc_rm} -> {tp_rm:.2f} req/s")
print(f"monolithic 2-replica baseline: {tp_mono:.2f} req/s ({tp_rm / tp_mono:.1f}x slower than disaggregated)")
print(f"mix shift: bottleneck stage {int(np.argmin(caps_a))} -> {int(np.argmin(caps_b))}, "
      f"stale {tp_stale:.2f} req/s vs re-matched {pipeline_throughput(SVC_B, list(best_b)):.2f} req/s ({best_b})")
print("all stage-disaggregation assertions passed")

Output of the run: workload A goes from 3.00 to 4.00 req/s when the same 7 instances are rebalanced (and brute force confirms (2, 4, 1) is optimal), the monolithic two-replica baseline manages 1.25 req/s (3.2x slower than the disaggregated 4.00), and the mix shift moves the bottleneck from stage 1 to stage 0, where the stale allocation's 1.67 req/s loses to a re-matched 2.50 req/s at (3, 3, 1). The paper's Qwen3-Omni time decomposition shows why sizing matters: the Talker dominates end-to-end latency because audio needs far more tokens than text (on video inputs, an average of 841.6 input tokens produce 150.9 text but 545.4 audio tokens).³

How to use it¶

Reference templates from the project docs (as of 2026-07; the project pins against specific upstream vLLM versions, so take the exact pin from the current quickstart):

# Install: vLLM first (version per the quickstart matrix), then vLLM-Omni.
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install vllm==0.24.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni && uv pip install -e .

# Reference template (unexecuted): offline generation through the Omni entrypoint.
from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Tongyi-MAI/Z-Image-Turbo")        # stage_configs_path=... for custom splits
outputs = omni.generate("a cup of coffee on the table")
outputs[0].request_output.images[0].save("coffee.png")

Serving uses the standard vLLM CLI with the omni flag (vllm serve <model> --omni --port 8091) and an OpenAI-compatible API; generation parameters for visual stages (resolution, steps, guidance, seed) ride in extra_body. Per-stage runtime configuration (parallelism strategy, memory budget, device placement per stage) is what the paper's evaluation tunes: Thinker tensor-parallel across both accelerators, Talker on one device, Vocoder on the other.³

How to develop with it¶

Porting a model means writing its stage graph: a forward per stage in the familiar step-centric style, preprocess functions where a stage consumes upstream state each iteration, and transfer functions on the edges. Multimodal encoders can be a separate stage or folded into the first LLM stage (the Qwen implementation folds them, following vLLM). The discipline that keeps a port maintainable is treating the per-request intermediate dictionary as the only channel between stages; logic that bypasses it re-couples stages and silently breaks disaggregated placement. For the DiT side, the diffusion engine already carries the attention backends, denoising caches, and sequence/context parallelism, so a new visual stage inherits those optimizations rather than reimplementing them (each model family still ships its own pipeline implementation).¹

How to maintain it¶

Track the release train. Releases align with upstream vLLM versions (0.16.0 rebased onto vLLM v0.16.0; 0.22.0 aligns with vLLM 0.22); pin the pair and upgrade them together, re-running a fixed benchmark per stage type on every bump.⁵
Re-tune stage allocation on model or traffic change. The executed example above is the failure in miniature: an allocation tuned for one mix quietly becomes the wrong one; alert on per-stage queue depth divergence, which the fluid model shows is monotone and fast.
Watch connector transports. Shared-memory transfers on one node and Mooncake across nodes have different failure and latency profiles (5.49 ms vs 8.28 ms on the paper's slower edge, Thinker2Talker); a topology change that silently moves an edge from one to the other is measurable but easy to miss.⁴
Keep an eye on model coverage. Support lands per family and per release (Cosmos3 and DreamZero world models, MiniCPM-o 4.5, MOSS-TTS in 0.22.0); the supported-models page, not the README headline, is the source of truth.⁵

Running it in production¶

Size each stage pool by rate matching against the traffic mix, exactly as for prefill/decode pools: measure per-stage service time at production batch sizes, set the instance ratio to balance stage capacities, and re-derive when the mix moves (audio-heavy vs image-heavy requests bottleneck different stages, as the executed example asserts). Streaming stage output is the TTFT lever: for interactive audio, confirm the vocoder consumes partial Talker output rather than full sequences. Cross-node placement puts bandwidth-heavy edges (hidden states, media tensors) on RDMA-backed Mooncake transport; per-edge connector configuration means one slow edge does not force the whole graph onto the network. For capacity math, the per-modality metrics are the ones the paper reports: real-time factor and JCT for audio, JCT for visual generation, per-stage tokens/s for diagnosis.³ The OpenAI-compatible server makes the fleet look like any other vLLM deployment to the routing layer.

Failure modes¶

Stage-pool imbalance. One undersized stage starves the pipeline while other pools idle; the symptom is queue growth in exactly one stage (validated above). Fix by re-matching the instance ratio, not by scaling everything.
Traffic-mix drift. An allocation tuned for chat traffic degrades when image-editing requests arrive; the bottleneck stage moves. Monitor per-stage saturation, not just end-to-end latency.
Batching-regime mismatch across stages. AR stages thrive on continuous batching while DiT stages batch over denoising steps; forcing one scheduler policy across both (as monolithic implementations do) forfeits most of the measured gains.
Transfer edges on the wrong transport. A hidden-state edge accidentally routed over TCP instead of shared memory or RDMA adds per-request latency on every token boundary; per-edge connector settings deserve review in any topology change.
Version skew between vLLM core and the omni layer. The stage engines embed vLLM internals; mixing an untested pair produces subtle scheduler and KV-manager breakage rather than clean errors. Upgrade as a pinned pair.⁵
Per-stage OOM from mixed workloads. Memory budgets are per stage; a long-context request through the Thinker and a high-resolution request through the DiT stress different pools, so headroom must be provisioned per stage, not globally.

References¶

Yin et al., vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models (arXiv 2602.02204): https://arxiv.org/abs/2602.02204
Hugging Face paper page: https://huggingface.co/papers/2602.02204
vLLM-Omni repository (vllm-project): https://github.com/vllm-project/vllm-omni
vLLM-Omni documentation: https://docs.vllm.ai/projects/vllm-omni/en/latest/
vLLM repository: https://github.com/vllm-project/vllm
Mooncake (KV-cache-centric transfer engine used by the cross-node connector): https://github.com/kvcache-ai/Mooncake

vLLM-Omni (arXiv 2602.02204): stage graph (nodes are AR/DiT/CNN stages, edges are transfer functions); per-stage engines with independent scheduling, batching, memory budgets, and parallelism; preprocess functions run every iteration (Talker re-concatenates Thinker hidden states per decode step), transfer functions once per request; orchestrator for routing; diffusion engine with flash/SAGE/TurboAttention, TeaCache and cache-dit, RingAttention context parallelism, Ulysses sequence parallelism; streaming stage output for TTFT; unified connector generalizing KV transfer to embeddings, hidden states, and media tensors (queues plus shared memory on-node; Ray plus Mooncake TCP/RDMA across nodes; also carries prefill/decode KV and encoder multimodal cache, EPD-compatible); hardware support via vLLM's plugin architecture. ↩↩↩↩↩↩
Architectures surveyed in the paper: Qwen-Omni Thinker-Talker (vocoder is a DiT in Qwen2.5-Omni, CNN-based in Qwen3-Omni); GLM-Image (semantic-VQ encoder, 9B GLM-4 AR LLM, 7B single-stream DiT); BAGEL (Mixture-of-Transformers with separate understanding/generation experts); LongCat-Flash-Omni (560B MoE plus LSTM/CNN audio decoder); Step-Audio (130B LLM, DiT flow-matching, neural vocoder). ↩
Paper evaluation (two 80 GB accelerators, 24 CPU cores, 192 GB RAM, vLLM 0.12.0; librispeech_asr / food101 / ucf101-subset, first 100 queries each; Thinker TP across both devices, Talker on device-1, Vocoder on device-0): Qwen2.5-Omni RTF -61.4%, JCT -61.6%, Thinker TPS 1.29x, Talker TPS 1.97x; Qwen3-Omni RTF -90.7%, JCT -91.4%, Thinker TPS 12.97x, Talker TPS 7.98x (30B vs 7B Thinker; baseline lacks graph compilation); Talker dominates latency (video inputs: 841.6 avg input tokens, 150.9 text out, 545.4 audio out); BAGEL 1024x1024 on VBench: T2I 23.12 s to 9.64 s (2.40x), I2I 41.39 s to 11.12 s (3.72x); MiMo-Audio on SeedTTS: RTF 1.39 to 0.60, 0.12 with graph compilation (11.58x); diffusion engine vs Diffusers 1.26x overall (Qwen-Image, Qwen-Image-Edit, Wan2.2 T2V/I2V at 480x640, 80 frames); BAGEL and MiMo-Audio ran on a single 80 GB accelerator. ↩↩↩↩↩↩
Paper Table 1 (Qwen2.5-Omni): Thinker2Talker 5.49 ms shared memory / 8.28 ms Mooncake; Talker2Vocoder 0.53 ms / 3.34 ms; negligible against tens-of-seconds inference. ↩↩
vllm-project/vllm-omni README (verified 2026-07): community release 2025-11; 0.14.0 first stable (2026-02); 0.16.0 rebases onto upstream vLLM v0.16.0; 0.22.0 (2026-06) aligns with vLLM 0.22 and adds Cosmos3/DreamZero world models, MiniCPM-o 4.5, MOSS-TTS, and VeRL-Omni RL integration; OmniConnector-based full disaggregation with dynamic resource allocation; OpenAI-compatible server; TP/PP/DP/EP; CUDA, ROCm, MUSA, NPU, XPU backends; supported families include Qwen3-Omni, Cosmos, HunyuanImage, BAGEL, Qwen3-TTS, VoxCPM2, Ming-Omni-TTS, CosyVoice3, Qwen-Image, Wan2.2, FLUX. ↩↩↩↩↩