Software performance engineering for FMware¶
Scope: treating performance as a first-class, continuous discipline for foundation-model-powered software (FMware), not a post-deployment afterthought. This page covers what software performance engineering (SPE) means for FMware, why performance goals (throughput, latency) are SLOs that decide user satisfaction and cost, the four challenge areas a team must engineer (cognitive architecture, communication protocols, tuning and optimization, deployment), and how to keep performance from degrading over time. It is the discipline layer over the concrete levers already in this KB: inference serving, goodput, SLOs for inference, agentic systems, and SRE/MLOps practices.
This distills a challenges-and-vision paper (Zhang et al., 2411.09580); the practices it points to are cross-linked to the KB pages that detail them. The Python example is executed and asserted (numpy); it validates the capacity relation, not any specific serving stack.
flowchart LR
U["User request"] --> CA["Cognitive architecture: agents, steps, semantic cache"]
CA --> COMM["Communication: structured output + response parsing"]
COMM --> TUNE["Tuning: batching, quantization, speculative decode, model routing"]
TUNE --> DEP["Deployment: autoscaling, hardware use, memory swap"]
DEP --> SLO["Meet throughput + latency SLOs"]
SLO -.->|"continuous SPE: monitor and prevent degradation"| CA
What it is¶
Software performance engineering for FMware is the practice of ensuring an FM-powered application meets its performance goals, principally throughput and latency, which are SLAs/SLOs whose breach means unhappy users or unaffordable cost. FMware is often a compound system, not a single model call: it spans a taxonomy from Promptware (software built mostly from natural-language prompts that call FMs directly) to Agentware (autonomous agents that use tools, hold memory, and communicate), so its performance is the performance of a whole graph of AI and classical components, not one inference. The paper frames the discipline around four challenge areas:1
- Cognitive architecture design. The structural design of how AI components interact, reason, and interface with classical software. It is where a semantic cache (vector-similarity reuse of prior prompts and completions) can eliminate redundant inference calls, and where structured formats (JSON, tables) simplify the graph and cut error handling.
- Communication protocols. How components exchange requests and results, including parsing FM outputs reliably and cheaply (for example a CPU-side robust parser catching malformed output at a 0.01% error rate rather than re-invoking the model).
- Tuning and optimization. The inference-level and application-level levers: batching, quantization, speculative decoding, model routing (send easy queries to cheaper models), and caching.
- Deployment. Getting hardware utilization right: autoscaling to demand, avoiding the idle-GPU cost of over-provisioning, and memory techniques such as GPU memory swap.
Why use it¶
- Performance-as-afterthought is expensive. When throughput and latency are considered only after a prototype ships, the fix is costly post-deployment re-engineering; SPE moves the concern earlier, where it is cheap to address.1
- FMware is compute-hungry. FM inference dominates cost and latency, so efficient hardware use is not optional; the difference between a tuned and an untuned cognitive architecture is often an order of magnitude in served cost.
- It degrades without attention. Prompts, models, traffic mix, and component graphs all drift, so performance must be engineered continuously (a regression gate and monitoring) or it decays back below the SLO.1
- It is a goals problem, not just a kernel problem. The KB's GPU and kernel pages make each call fast; SPE decides which calls happen, how they batch, and whether the end-to-end request meets its latency budget.
When to use it (and when not)¶
- Use it as soon as an FMware prototype is on the path to production: set throughput/latency SLOs, and engineer the four challenge areas against them before, not after, launch.
- Use it continuously for anything in production: wire performance into CI and monitoring so drift is caught early (perf-regression CI, observability).
- Skip the heavy machinery for a genuine throwaway prototype with no users and no cost ceiling; premature optimization of a demo wastes effort.
- Do not reduce it to GPU tuning. Faster kernels do not fix an architecture that makes ten sequential FM calls where two cached ones would do; the biggest wins are usually architectural.
Architecture¶
The four challenge areas map onto the request path in order: a request enters the cognitive architecture (which decides how many FM/agent steps run and what the semantic cache can short-circuit), moves through communication (structured output and parsing between components), is served under tuning levers (batching, quantization, speculative decoding, routing), on infrastructure governed by deployment (autoscaling and hardware utilization), all measured against throughput and latency SLOs. The loop closes with continuous monitoring that feeds regressions back into the architecture. The load-bearing relationship across the whole path is capacity: throughput, latency, and concurrency are not independent knobs.
How to use it¶
Start from the capacity relation that ties the SLOs together. Little's Law states that the average number of requests in the system equals throughput times average time-in-system (L = lambda * W); it holds for any stable system regardless of the arrival or service distribution, and it is what tells you whether a latency SLO and a throughput target are jointly achievable on a given concurrency budget. This runnable simulation checks it, measuring L independently of W, and shows the adversarial case the paper warns about, where demand exceeds capacity and latency diverges:
# littles_law.py — validated: Little's Law L = lambda * W, with L measured INDEPENDENTLY by
# time-sampling the number in system (not derived from W); overload makes latency diverge. numpy only.
import numpy as np
def simulate(lam, mu, n=200_000, seed=0): # single FIFO server: arrival rate lam, service rate mu
rng = np.random.default_rng(seed)
arr = np.cumsum(rng.exponential(1 / lam, n)) # Poisson arrivals
svc = rng.exponential(1 / mu, n) # service times
dep = np.empty(n); free = 0.0
for i in range(n):
start = max(arr[i], free); dep[i] = start + svc[i]; free = dep[i]
W = dep - arr # time in system per request
t = np.linspace(arr[1000], dep[-1000], 4000) # sample interior times (skip warm-up/cool-down)
L = float(np.mean(np.searchsorted(arr, t, side="right") # count N(t) = arrived - departed,
- np.searchsorted(dep, t, side="right"))) # measured independently of W
return L, n / dep[-1], W.mean() # L (time-sampled), throughput lambda, mean W
L, lam, W = simulate(0.8, 1.0) # utilization 0.8 (stable)
assert abs(L - lam * W) / (lam * W) < 0.02 # Little's Law: independent count ~= lambda * W
assert abs(L - 0.8 / (1 - 0.8)) < 0.2 # both match M/M/1 theory rho/(1-rho) = 4.0
_, _, w_ok = simulate(0.5, 1.0) # under capacity
_, _, w_over = simulate(1.5, 1.0) # demand > capacity (overload)
assert w_over > 10 * w_ok # latency diverges once demand exceeds service rate
print(f"L={L:.2f} lambda*W={lam * W:.2f} mean-wait stable={w_ok:.2f} overload={w_over:.0f}")
The practical reading: pick the SLO (a p99 latency W and a throughput lambda), and Little's Law fixes the concurrency L you must provision for; if the offered load pushes utilization toward one, latency runs away, so admission control and autoscaling exist to keep the system on the stable side (inference QoS and admission control, SLOs for inference).
How to develop with it¶
Engineer the four challenge areas, each grounded in a concrete KB lever:
- Cognitive architecture. Minimize and cache FM calls before optimizing them. A semantic cache reuses prior completions for similar prompts by vector similarity, cutting redundant inference and latency; prefix/KV reuse does the same at the token level (KV cache management). Route by difficulty so easy queries hit cheaper models (LLM request routing), and prefer structured outputs to shrink the graph.
- Communication protocols. Make components exchange machine-parseable, verifiable output (JSON/schema), and parse it cheaply on the CPU rather than re-invoking the model on malformed responses (constrained decoding). Structured formats also improve FM accuracy and reduce error-handling code.
- Tuning and optimization. Apply the serving levers against the SLO: continuous batching (including application-level batching by workflow dependency for multi-query latency wins), quantization, speculative decoding, and disaggregated prefill/decode. These change the per-call cost that the architecture then multiplies.
- Deployment. Right-size hardware: autoscale to demand and attack the idle-GPU cost of low utilization (capacity planning, dynamic and fractional GPU sharing), and use memory techniques (GPU memory swap) to fit more work per device.
How to maintain it¶
Performance decays, so SPE is a standing loop, not a launch checklist. Set explicit throughput and latency SLOs and track them continuously with an error budget (SLO/SLI catalog, inference serving SLOs); measure goodput (useful throughput under the latency bound), not raw tokens per second, because a system can raise throughput while missing its latency SLO (goodput). Gate changes on a performance-regression test in CI so a new prompt, model, or component graph cannot silently blow the budget (perf-regression CI), and monitor the whole compound path, not just the model call, since an agentic graph's latency is the sum of its steps (observability, agent observability). Because prompts and traffic mixes drift, re-profile periodically and feed regressions back into the cognitive architecture.
How to run it in production¶
Deployment is where the capacity math meets real hardware and money. Autoscale on demand so utilization stays high without breaching latency, and treat low hardware utilization as a direct cost leak rather than a comfort margin. Keep the system on the stable side of Little's Law with admission control and load shedding under bursts (QoS and admission control), scale out decode with disaggregated inference, and use GPU memory swap and fractional GPU sharing to pack more work per device. Run the incident lifecycle (detect, localize, mitigate, confirm) for latency and throughput breaches through the operations layer (agentic AIOps, inference SLO-breach runbook). The whole point is that these deployment controls are engineered against the SLOs from the start, so production is where a continuous SPE program pays off instead of where firefighting begins.
Failure modes¶
- Performance as an afterthought. Deferring throughput/latency until after launch turns a config change into a re-architecture; set SLOs and engineer against them up front.
- Optimizing the call, not the graph. Faster kernels do not fix an architecture that makes many avoidable FM calls; cut and cache calls first.
- Throughput without the latency bound. Reporting tokens per second while missing p99 latency is a vanity metric; measure goodput under the SLO.
- Running near utilization one. Pushing offered load to capacity makes latency diverge (the overload case above); leave headroom and shed load.
- Idle-GPU over-provisioning. Fixed, over-sized fleets waste money at low utilization; autoscale and share devices.
- No continuous gate. Without a perf-regression test and monitoring, prompt/model/traffic drift silently erodes the SLO until users notice.
References¶
- Zhang, Chang, Leung, Thangarajah, Chen, Lutfiyya, Hassan, Software Performance Engineering for Foundation Model-Powered Software (arXiv 2411.09580): https://arxiv.org/abs/2411.09580
- Shekhar et al., Towards Optimizing the Costs of LLM Usage (model routing by difficulty, arXiv 2402.01742): https://arxiv.org/abs/2402.01742
- Run:ai, GPU Memory Swap (deployment memory technique): https://www.run.ai/blog/gpu-memory-swap
- Little's Law (queueing relation
L = lambda * W): https://en.wikipedia.org/wiki/Little%27s_law
Related: Inference serving · Goodput · SLOs for inference · Inference QoS and admission control · LLM request routing · Continuous batching · Disaggregated inference · Agentic systems · Agent harness architecture · SRE/MLOps practices · Glossary
-
Zhang et al., Software Performance Engineering for Foundation Model-Powered Software (arXiv 2411.09580): FMware must meet throughput/latency goals (SLAs/SLOs) or incur user dissatisfaction and cost; performance is often an afterthought, forcing costly post-deployment optimization, and continuous performance engineering is needed to prevent degradation. The four SPE challenges are cognitive architecture design (how AI components interact, reason, and interface with classical software; semantic caching; structured formats), communication protocols (efficient, reliable response parsing), tuning and optimization (batching, quantization, speculative decoding, model routing, caching), and deployment (autoscaling, hardware utilization, GPU memory swap). FMware spans Promptware to Agentware. ↩↩↩