Skip to content
Markdown

Multi-LoRA / adapter serving

Scope: serving many LoRA adapters over a single shared base model, batching requests that use different adapters together and paging adapters in and out at request time, so one deployment serves hundreds of finetunes at nearly the cost of one. The serving-side payoff of LoRA finetuning; the systems techniques (S-LoRA, Punica) and the vLLM API that make it practical.

Reference templates on real APIs; pin versions and validate before production use.

What it is

A LoRA adapter is a small low-rank delta (megabytes) over a frozen base (SFT/LoRA). Multi-LoRA serving keeps one base model resident and applies a per-request adapter, batching requests for different adapters in the same forward pass. Two systems established the technique:

  • S-LoRA serves thousands of adapters on a single GPU via Unified Paging (adapters and KV cache share one memory pool) and custom heterogeneous-batching CUDA kernels; up to ~4× the throughput of naive per-adapter serving.1
  • Punica is a multi-tenant design whose SGMV (Segmented Gather Matrix-Vector) kernel batches computation across different adapters against a single base copy; ~12× throughput over prior systems at ~2ms/token added latency.2

vLLM implements this natively (enable_lora), so in practice you configure it rather than build it.

Why use it

  • One base, many finetunes. Serve hundreds of per-customer or per-task adapters at roughly the memory of a single model: the adapters are MBs, the base is the only large resident weight.
  • Cost and utilization. Batching across adapters keeps the GPU busy instead of running one under-utilized deployment per finetune (S-LoRA ~4×, Punica ~12× throughput).12
  • Hot-swap at runtime. Add, remove, and route adapters without redeploying the base. This is the natural serving target for a fleet of LoRA finetunes and a per-tenant data flywheel.

When to use it (and when not)

  • Use multi-LoRA serving when you have many LoRA adapters over one base and want to serve them cost-efficiently (multi-tenant SaaS, per-task or per-customer finetunes).
  • Prefer merging the adapter into the base when you serve one finetune at high volume: merge_and_unload() folds the adapter into the weights so there is zero per-request LoRA overhead (SFT/LoRA, model merging).
  • All adapters must share the base and fit under the configured max_lora_rank; you cannot mix adapters trained on different bases in one server.
  • Full finetunes cannot be served this way: they are whole models, not deltas; use separate deployments or model merging.

Architecture

flowchart LR
  R1["Req → adapter A"] --> SCHED["Scheduler: batch across adapters"]
  R2["Req → adapter B"] --> SCHED
  R3["Req → base (no adapter)"] --> SCHED
  SCHED --> BASE["Base forward (shared weights)"]
  BASE --> SGMV["Per-adapter LoRA (SGMV / gather kernel)"]
  POOL["Adapter pool (CPU RAM)"] -.->|"page in on demand"| SGMV
  SGMV --> OUT["Batched responses"]

How to use it

vLLM serves adapters both offline and as an OpenAI-compatible server. Offline, enable LoRA and pass a LoRARequest per prompt:

# REFERENCE TEMPLATE (needs vllm + a GPU) - not run here. Verify flags on your vLLM version.
# multi_lora_offline.py: one base, many adapters, an adapter chosen per request.
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
          enable_lora=True, max_loras=4, max_lora_rank=16)   # max_loras = adapters live per batch
sampling = SamplingParams(temperature=0.0, max_tokens=256)

# LoRARequest(human_name, globally_unique_int_id, adapter_path_or_hf_id)
out_sql  = llm.generate("Translate to SQL: ...", sampling,
                        lora_request=LoRARequest("sql", 1, "/adapters/sql"))
out_supp = llm.generate("Answer the ticket: ...", sampling,
                        lora_request=LoRARequest("support", 2, "/adapters/support"))

As a server, register adapters with --lora-modules; each adapter then appears as a selectable model:

# REFERENCE TEMPLATE (needs a vLLM build with LoRA support) - not run here. Pin the vLLM version.
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 4 --max-lora-rank 16 \
  --lora-modules sql=/adapters/sql support=/adapters/support
# Select the adapter by naming it in the request's `model` field (base is served too).
curl http://localhost:8000/v1/completions \
  -d '{"model": "sql", "prompt": "Translate to SQL: ...", "max_tokens": 256}'

The one guarantee both entry points rely on: naming an adapter per request (via LoRARequest offline, or the model field over HTTP) must apply exactly that adapter's low-rank delta, "base" must apply none, and a mixed stream served this way must be identical to serving each request one at a time. The numpy-only block below models that name-to-adapter dispatch and asserts the equivalence, plus an adversarial case where an unknown adapter name errors instead of silently falling back to the base:

# route_select.py -- validated: naming an adapter per request routes to the right delta; numpy only.
import numpy as np
rng = np.random.default_rng(7)
D_in, D_out, r = 8, 6, 2
W = rng.standard_normal((D_out, D_in)).astype(np.float32)                 # single shared base copy
def make(alpha):                                                          # (A: r x in, B: out x r, alpha)
    return (rng.standard_normal((r, D_in)).astype(np.float32),
            rng.standard_normal((D_out, r)).astype(np.float32), float(alpha))
registry = {"sql": make(16.0), "support": make(32.0)}                     # --lora-modules name=path table
def forward(x, name):                                                     # name is None -> base is served too
    if name is None:
        return W @ x
    A, B, alpha = registry[name]
    return W @ x + (alpha / r) * (B @ (A @ x))

reqs = [(rng.standard_normal(D_in).astype(np.float32), n)                 # a mixed request stream
        for n in ["sql", "support", None, "sql", None, "support"]]
served = np.stack([forward(x, n) for x, n in reqs])                       # dispatch by the request's model field

# 1) Selecting "sql" applies exactly the sql delta, never support's, and differs from base.
x0 = reqs[0][0]
Asql, Bsql, asql = registry["sql"]
assert np.allclose(forward(x0, "sql"), W @ x0 + (asql / r) * (Bsql @ (Asql @ x0)), atol=1e-6)
assert not np.allclose(forward(x0, "sql"), forward(x0, "support"), atol=1e-4)  # adapters distinguishable
assert np.allclose(forward(x0, None), W @ x0, atol=1e-6)                       # base == no adapter

# 2) Equivalence to a slow reference: one-at-a-time serving == table dispatch, row for row.
ref = np.empty_like(served)
for i, (x, n) in enumerate(reqs):
    if n is None:
        ref[i] = W @ x
    else:
        A, B, alpha = registry[n]
        ref[i] = W @ x + (alpha / r) * (B @ (A @ x))
assert np.allclose(served, ref, atol=1e-6)

# 3) Adversarial: a request naming an unregistered adapter must raise, not silently serve the base.
try:
    forward(x0, "does-not-exist")
    raised = False
except KeyError:
    raised = True
assert raised, "unknown adapter name must error, not serve the base silently"

print(f"OK: per-request select routes to the right delta; served==one-at-a-time "
      f"(max diff {np.abs(served-ref).max():.1e}); unknown-name raises")

Run output:

OK: per-request select routes to the right delta; served==one-at-a-time (max diff 0.0e+00); unknown-name raises

How the batching works (SGMV)

The throughput win comes from computing many adapters' LoRA math in one batched pass instead of looping per request. The kernel is SGMV (Segmented Gather Matrix-Vector): tokens are sorted so each adapter's tokens are contiguous, each adapter's B @ (A @ x) runs as one segment added to the shared base output, and results scatter back to token order. The transform is exact, so a mixed-adapter batch produces the same output as running each request separately. This runnable model asserts that equivalence (and that the alpha/r scale is load-bearing):

# sgmv_equiv.py -- validated: grouped (SGMV) mixed-adapter LoRA == per-token loop; numpy only.
import numpy as np
rng = np.random.default_rng(0)
D_in, D_out, r = 16, 12, 4
W = rng.standard_normal((D_out, D_in)).astype(np.float32)              # shared base weight

def make(alpha):
    return (rng.standard_normal((r, D_in)).astype(np.float32),        # lora_A (r x in)
            rng.standard_normal((D_out, r)).astype(np.float32),       # lora_B (out x r)
            float(alpha))
adapters = {0: make(32), 1: make(16)}                                 # effective scale = alpha / r

def lora_fwd(x, aid):
    A, B, alpha = adapters[aid]
    return W @ x + (alpha / r) * (B @ (A @ x))                        # W'x = Wx + (alpha/r)*B*A*x

def sgmv(batch):                                                      # grouped: one matmul per adapter group
    out = np.empty((len(batch), D_out), dtype=np.float32)
    for aid in {a for _, a in batch}:                                # segment by adapter id
        idx = [i for i, (_, a) in enumerate(batch) if a == aid]      # gather this adapter's tokens
        X = np.stack([batch[i][0] for i in idx]).T
        A, B, alpha = adapters[aid]
        Y = W @ X + (alpha / r) * (B @ (A @ X))                      # segmented gather matrix-vector
        for j, i in enumerate(idx):
            out[i] = Y[:, j]                                         # scatter back to token order
    return out

batch = [(rng.standard_normal(D_in).astype(np.float32), aid) for aid in [1, 1, 0, 1, 0]]
ref = np.stack([lora_fwd(x, aid) for x, aid in batch])               # per-token reference
grouped = sgmv(batch)
assert np.allclose(ref, grouped, atol=1e-5)                         # heterogeneous batch == per-request

# Boundary: a homogeneous batch (all tokens on one adapter) is still exact -- one full segment.
homo = [(x, 1) for x, _ in batch]
assert np.allclose(sgmv(homo), np.stack([lora_fwd(x, 1) for x, _ in homo]), atol=1e-5)

# Adversarial 1: the alpha/r scale is load-bearing -- dropping it must break equivalence.
def sgmv_noscale(batch):
    out = np.empty((len(batch), D_out), dtype=np.float32)
    for aid in {a for _, a in batch}:
        idx = [i for i, (_, a) in enumerate(batch) if a == aid]
        X = np.stack([batch[i][0] for i in idx]).T
        A, B, _ = adapters[aid]
        Y = W @ X + (B @ (A @ X))                                   # bug: missing alpha/r
        for j, i in enumerate(idx):
            out[i] = Y[:, j]
    return out
assert not np.allclose(sgmv_noscale(batch), ref, atol=1e-3)        # wrong scale is detected

# Adversarial 2: a mis-scattered (permuted) result is NOT accepted as equal -- ordering is load-bearing.
bad = grouped.copy()
bad[[0, 2]] = bad[[2, 0]]                                           # swap two rows across adapter groups
assert not np.allclose(bad, ref, atol=1e-3)                        # scatter-back order must be exact

print(f"max|grouped-ref|={np.abs(ref-grouped).max():.1e}; homogeneous-batch exact; "
      f"missing-scale and mis-scatter both detected")

Run output:

max|grouped-ref|=2.3e-05; homogeneous-batch exact; missing-scale and mis-scatter both detected

SGLang normalizes adapter module names to the base model's fused granularity first (per-projection q_proj/k_proj/v_proj adapters are concatenated into a qkv_proj adapter, matching the fused base layer from weight loading), then wraps each base layer with a LoRA-aware layer that adds the adapter output to the base output.3

How to integrate with it

Three knobs govern capacity and correctness:

  • max_lora_rank must be at least the rank of the largest adapter, or that adapter will not load. Set it to your fleet's max rank (over-provisioning costs memory).
  • max_loras caps how many distinct adapters can be in a single batch; raise it for more concurrency, at the cost of per-batch memory and kernel overhead.
  • max_cpu_loras sizes the CPU-RAM adapter cache that vLLM pages to the GPU on demand (the S-LoRA Unified Paging idea). Size it to your working set of hot adapters so requests rarely wait on a disk load.

Adapters can be loaded dynamically over the API (behind a flag) so the fleet changes without a restart; route each request to its adapter by tenant/task upstream. To wire this into an application: register the initial adapter set with --lora-modules at start, add or remove adapters over the admin API as tenants come and go (see the dynamic-load cookbook entry), and put an adapter selector at the edge that maps each request to a model name (offline, a LoRARequest id). Keep the id-to-name mapping stable: LoRARequest ids are globally unique integers, and reusing an id for a different adapter routes traffic to the wrong delta.

How to run it in production

The production decision is per adapter, not per fleet: serve it multi-LoRA, or fold it into the base.

  • Keep it multi-LoRA while many adapters share the base and each sees moderate traffic: one deployment, hot-swap per request, and the storage cost of an adapter is megabytes on top of the one resident base.
  • Merge a single hot adapter into the base with merge_and_unload() (see the cookbook) when one finetune dominates traffic: the fused checkpoint serves as a normal model with zero per-request LoRA overhead. Merging into a 4-bit (QLoRA) base needs care: fold into the dequantised/BF16 base, not the quantised one (SFT/LoRA, model merging).

The correctness guarantee that makes this a free choice is that merging is exactly equivalent to serving the adapter separately: a single matmul against W + (alpha/r)*B*A produces the same output as the two-matmul multi-LoRA path, to floating-point tolerance. Merge a hot adapter for efficiency without changing its behaviour, and gate any merged checkpoint behind the same eval you trained it against before promotion. The numpy-only block below proves that equivalence, bounds the added rank, catches a wrong-adapter (base-lineage) merge, and locates the storage crossover where a per-adapter store stops being small relative to a full base copy:

# merge_equiv.py -- validated: merging a hot adapter == serving it multi-LoRA; numpy only.
import numpy as np
rng = np.random.default_rng(3)
D_in, D_out, r, n = 32, 24, 4, 10
W = rng.standard_normal((D_out, D_in))
A = rng.standard_normal((r, D_in))
B = rng.standard_normal((D_out, r))
X = rng.standard_normal((n, D_in))
alpha = 8.0

def delta(A, B, alpha, r):             return (alpha / r) * (B @ A)                  # the folded low-rank update
def serve_multilora(X, W, A, B, a, r): return X @ W.T + (a / r) * (X @ A.T) @ B.T    # base + per-request adapter
def serve_merged(X, W, A, B, a, r):    return X @ (W + delta(A, B, a, r)).T          # single fused matmul

y_split  = serve_multilora(X, W, A, B, alpha, r)
y_merged = serve_merged(X, W, A, B, alpha, r)
assert np.allclose(y_split, y_merged, atol=1e-10)                   # merged == multi-LoRA, exactly

# The fold adds at most rank r of new directions to the base (it is a rank-r update).
assert np.linalg.matrix_rank(delta(A, B, alpha, r)) <= r

# Equivalence to a slow reference: an explicit per-row loop must match the vectorised merged serve.
W_eff = W + delta(A, B, alpha, r)
ref = np.stack([W_eff @ X[i] for i in range(n)])
assert np.allclose(y_merged, ref, atol=1e-10)                      # vectorised == per-row reference

# Adversarial: merging the WRONG adapter (base lineage mismatch) must NOT match the intended output.
Bad = rng.standard_normal((D_out, r))                              # a different adapter's B
assert not np.allclose(serve_merged(X, W, A, Bad, alpha, r), y_merged, atol=1e-6)

# Boundary: the storage crossover. A separate adapter adds r*(D_in+D_out) params on top of one shared
# base; a merged copy is a whole D_out*D_in base. Keeping adapters separate is the win exactly while
# a per-adapter store is smaller than a full base, i.e. r < (D_in*D_out)/(D_in+D_out).
def per_adapter(r): return r * (D_in + D_out)
full_base = D_out * D_in
r_cross = full_base // (D_in + D_out)                               # crossover rank
assert per_adapter(r) < full_base                                  # small adapter cheaper than a 2nd base
assert per_adapter(r_cross) < full_base <= per_adapter(r_cross + 1) # store stops being small past r_cross

print(f"OK: merged==multi-LoRA (max diff {np.max(np.abs(y_split-y_merged)):.1e}); "
      f"rank(delta)={np.linalg.matrix_rank(delta(A,B,alpha,r))}<=r={r}; wrong-adapter detected; "
      f"per-adapter {per_adapter(r)} < base {full_base}; crossover r={r_cross}")

Run output:

OK: merged==multi-LoRA (max diff 2.8e-14); rank(delta)=4<=r=4; wrong-adapter detected; per-adapter 224 < base 768; crossover r=13

Operationally: pre-warm the hot working set at start so the first real request does not pay a cold host-to-GPU load, keep per-adapter observability (requests, latency, and cache hit rate by adapter) so a mis-routed or thrashing adapter is visible, and size max_cpu_loras to that working set. Serve the merged model, or the base-plus-adapters set, on the same serving stack you use for OSS models.

How to maintain it

Maintenance is mostly about lineage and working-set hygiene, because an adapter is only valid against the exact base it was trained on and a mis-sized pool silently costs latency:

  • Version every adapter by (base checkpoint, r, alpha, target_modules, data hash). Swapping the base under an adapter changes behaviour with no error; enforce base lineage per server so a mismatched adapter is rejected, not served as garbage.
  • Re-evaluate after any rank or scaling change. Raising r (and lora_alpha) can fix under-fitting but can also over-fit small sets; serve with the adapter's trained lora_alpha/scaling, since a train/serve scaling mismatch changes outputs.
  • Manage the working set as churn happens. vLLM (and SGLang) keep adapters in a CPU pool and page hot ones to GPU slots under LRU eviction; when the hot set outgrows the pool, requests thrash on cold loads. Size max_cpu_loras to the working set, and shard adapters across replicas by routing when one replica's hot set no longer fits.
  • Keep the source adapter for any merged checkpoint. Merging is lossy to reproduce from the merged weights alone, and folding into a 4-bit base is a known silent-quality-loss trap (merge into BF16, see "How to run it in production" and model merging).

The pool's paging is an LRU cache: a request for a resident adapter is a hit, a miss evicts the least-recently-used adapter and pays a cold load. That is the arithmetic behind sizing max_cpu_loras. The numpy-only block below models it, checks it against a slow last-seen-scan reference, proves misses are monotone in pool size, and stresses the under-provisioned case where a round-robin over more adapters than the pool holds misses every request:

# lru_pool.py -- validated: LRU adapter paging (max_cpu_loras / S-LoRA Unified Paging); numpy only.
import numpy as np
from collections import OrderedDict

def run_lru(trace, capacity):
    """Serve `trace` (adapter ids) through an LRU pool of size `capacity`; return (hits, misses, evictions)."""
    assert capacity >= 1
    pool, hits, misses, evictions = OrderedDict(), 0, 0, 0
    for aid in trace:
        if aid in pool:
            pool.move_to_end(aid); hits += 1                       # resident: mark most-recently-used
        else:
            misses += 1                                           # cold load (host->GPU copy)
            if len(pool) >= capacity:
                pool.popitem(last=False); evictions += 1          # evict least-recently-used
            pool[aid] = True
    return hits, misses, evictions

def ref_lru(trace, capacity):
    """Slow reference: recency by explicit last-seen scan, no OrderedDict."""
    resident, hits, misses, ev, seen = [], 0, 0, 0, {}
    for t, aid in enumerate(trace):
        if aid in resident:
            hits += 1
        else:
            misses += 1
            if len(resident) >= capacity:
                lru = min(resident, key=lambda a: seen[a]); resident.remove(lru); ev += 1
            resident.append(aid)
        seen[aid] = t
    return hits, misses, ev

# Equivalence to the slow reference on a random tenant-routed trace.
rng = np.random.default_rng(11)
trace = [int(a) for a in rng.integers(0, 6, size=200)]            # 6 hot adapters, 200 requests
for cap in (1, 2, 3, 6, 10):
    assert run_lru(trace, cap) == ref_lru(trace, cap), f"LRU mismatch at capacity {cap}"

# Monotonic: a bigger pool never causes more misses (under a fixed trace).
miss_by_cap = [run_lru(trace, c)[1] for c in range(1, 8)]
assert all(b <= a for a, b in zip(miss_by_cap, miss_by_cap[1:])), "more capacity must not add misses"

# Capacity >= distinct adapters -> every adapter loads once, then all hits (working set fits).
distinct = len(set(trace))
h, m, e = run_lru(trace, distinct)
assert m == distinct and e == 0 and h == len(trace) - distinct    # cold-load each once, no eviction
assert h + m == len(trace)                                        # every request accounted for

# Adversarial thrash: pool of 2 but round-robin over 3 adapters -> 100% miss, evict every step once full.
thrash = [0, 1, 2] * 20
ht, mt, et = run_lru(thrash, 2)
assert ht == 0 and mt == len(thrash) and et == len(thrash) - 2    # every request misses; steady-state evict

print(f"OK: LRU==reference for caps {{1,2,3,6,10}}; misses monotone in capacity {miss_by_cap}; "
      f"fit-working-set -> {m} loads/0 evict; thrash -> {mt}/{mt} miss")

Run output:

OK: LRU==reference for caps {1,2,3,6,10}; misses monotone in capacity [172, 134, 107, 73, 40, 6, 6]; fit-working-set -> 6 loads/0 evict; thrash -> 60/60 miss

How to scale it

The throughput win is heterogeneous batching: many requests for different adapters share one base forward pass, so utilization stays high even with a long tail of rarely-used adapters (S-LoRA, Punica).12 Beyond one replica, this becomes a routing problem: send requests for the same adapter to the same replica so its adapter stays hot in GPU memory, an adapter-aware variant of LLM request routing. Adapter memory lives in a CPU pool and pages to GPU (max_cpu_loras); the base and KV cache dominate GPU memory as in any serving deployment (KV-cache management, continuous batching). Two SGLang refinements matter at scale: a chunked SGMV backend splits long per-adapter segments (when most tokens in a batch share one adapter) so a single hot adapter cannot monopolize a kernel and create long-tail latency; and overlap loading (--enable-lora-overlap-loading) prefetches not-yet-resident adapters on a dedicated CUDA stream during the window where the previous batch is still running on the GPU, so a cold adapter's host-to-GPU copy is hidden rather than paid on the critical path.3

Cookbook (common use cases)

1. Serve N per-tenant adapters: one vllm serve --enable-lora deployment with --lora-modules for each tenant; route by model name (above).

2. Merge instead, for a single hot adapter

# REFERENCE TEMPLATE (needs peft, transformers, torch) - not run here. Pin the versions.
# One high-QPS finetune: fold the adapter into the base so there is no per-request LoRA cost.
# The equivalence and rank bound this relies on are validated numerically in "How to run it in production".
from peft import AutoPeftModelForCausalLM
merged = AutoPeftModelForCausalLM.from_pretrained("/adapters/sql").merge_and_unload()
merged.save_pretrained("./merged-sql")     # then serve ./merged-sql as a normal model

3. Dynamic adapter load: enable runtime adapter loading and register/unregister adapters over the admin API as the tenant set changes, without restarting the base.

Failure modes

  • max_lora_rank too small. An adapter with rank above the configured max silently fails to load; set it to the fleet maximum.
  • Adapter/base mismatch. An adapter trained on a different base produces garbage; enforce base lineage per server.
  • Too many hot adapters. A working set larger than max_cpu_loras/GPU capacity thrashes paging and adds latency; shard adapters across replicas by routing.
  • Serving one adapter at high QPS unmerged. Per-request LoRA overhead is wasted when a single finetune dominates traffic; merge it into the base instead (model merging).
  • Cold-start latency. The first request for a cold adapter waits on a disk/host load; pre-warm hot adapters.
  • Rank/scaling mismatch. Wrong lora_alpha/scaling between train and serve changes behaviour; serve with the adapter's trained config.

References

  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters: https://arxiv.org/abs/2311.03285
  • Punica: Multi-Tenant LoRA Serving (SGMV kernel): https://arxiv.org/abs/2310.18547
  • vLLM, LoRA adapters (enable_lora, LoRARequest, --lora-modules): https://docs.vllm.ai/en/latest/features/lora.html
  • PEFT (adapter train + merge_and_unload): https://huggingface.co/docs/peft
  • Changyi, "How Does SGLang Support LoRA?" (SGMV chunked backend, memory pool, overlap loading): https://changyi.fun/posts/sglang_lora_en/

Related: SFT/LoRA · Model weight loading · Model merging · Inference serving · Continuous batching · KV-cache management · LLM request routing · Serving open-weight models · Fine-tuning and post-training · Glossary


  1. Sheng et al., S-LoRA, which serves thousands of LoRA adapters on one GPU via Unified Paging (adapters + KV cache in one pool) and custom heterogeneous-batching CUDA kernels; up to ~4× throughput over naive serving. https://arxiv.org/abs/2311.03285 

  2. Chen et al., Punica, a multi-tenant LoRA serving system whose SGMV kernel batches computation across different adapters against a single base copy; ~12× throughput at ~2ms/token added latency. https://arxiv.org/abs/2310.18547 

  3. Changyi, "How Does SGLang Support LoRA?": SGLang separates base and adapter compute and sums them, batches mixed adapters with SGMV, and keeps adapters in a CPU pool paged to GPU slots with LRU eviction; the default chunked SGMV backend splits long per-adapter segments to avoid long-tail latency, and --enable-lora-overlap-loading prefetches adapters on a dedicated CUDA stream during the CPU-schedule / GPU-execute overlap window. MoE LoRA needs a wrapper per base-layer type and is still landing (SGLang PR #14105). https://changyi.fun/posts/sglang_lora_en/