Markdown

Orchestration and the control plane¶

Scope: how to supervise one or many agents so that policy, budget, identity, retries, and state are decided above the loop instead of inside the model. The organising idea is a split between a data plane (model inference and tool execution) and a control plane (the gates that run before any action). The model may propose; the control plane must decide. This page builds on the harness, enforces the security policy, and is observed through agent observability.

Code and configs here are reference templates; pin versions and validate before relying on them.

flowchart LR
  REQ["User request"] --> ORCH["Orchestrator (loop)"]
  ORCH -->|"proposed action"| DECIDE["decide(action)"]
  subgraph CP["Control plane"]
    AUTH["1. authorize (Cedar)"] --> MUT["2. mutate (rewrite)"]
    MUT --> BUD["3. budget"]
    BUD --> RET["4. retries"]
    RET --> ALLOW["5. allow (bind identity, spend, audit)"]
  end
  DECIDE --> CP
  ALLOW -->|"verdict ALLOW / REWRITE"| TOOL["Data plane: tool / model call"]
  AUTH -->|"verdict DENY (replan signal)"| ORCH
  TOOL -->|"observation"| ORCH

Overview¶

A naive agent loop hands the model decisions it cannot see the facts for: whether an action is permitted, whether it fits the budget, whose authority it runs under, and whether to retry. Put those decisions in the data plane and nothing can say no before an action runs. The fix is to model orchestration as a control plane: a fixed sequence of gates that evaluates every proposed tool call or sub-agent spawn before it executes, and returns either an approved (possibly rewritten) action or a structured denial the model can replan against.¹

This is the same separation that distributed systems already use. Kubernetes does not let a pod schedule itself; an admission controller and a scheduler decide. The agent analogue puts the model in the data plane and the gates in the control plane.¹

Core knowledge¶

Data plane vs control plane¶

Data plane (the model proposes and acts): model inference, tool execution, retrieval, code generation, API calls.
Control plane (the layer decides): policy and authority, identity and tenancy, routing and scheduling, budget and quotas, durable state, evaluation, retries, observability and audit.¹

The failure signature when the split is missing is concrete: an agent retries forever on a refusal it reads as a network error, runs up an unbounded bill, or acts under a generic service account that erases who asked.

The decide() gate chain¶

Every proposed action runs through one function before it executes. The gates run in a fixed order, and any gate can stop the chain:¹

Authorize. A policy engine answers "may this run?" deny-by-default. Cedar is a good fit because policies are small, named, and analysable.
Mutate. Rewrite a permitted-but-unsafe action into a safe one (for example, inject a LIMIT into an unbounded SQL query) rather than denying it outright.
Budget. Check the action's estimated cost against the remaining budget; deny if it would overspend.
Retries. Apply the retry and backoff policy; distinguish a transient failure from a policy denial so the model does not loop on a hard no.
Allow. Bind a short-lived brokered identity to the action, spend the budget, and write an audit span before the data plane runs.

A minimal shape: an Action(user, tool, tool_input, cost_usd) goes in, a Decision(verdict, reason, tool_input, identity, span) comes out, where verdict is ALLOW, DENY, or REWRITE. The reason string carries the named policy rule so a denial reads as a replan signal, not a bare 403.¹

Wiring it into a real harness¶

The gate belongs in the framework's in-process pre-tool hook, not a network sidecar, so the deny reason flows back into the loop as text the model can act on. The Claude Agent SDK exposes a PreToolUse hook that returns a structured permissionDecision: deny envelope; the same Cedar policy file can drive a Rust harness, a TypeScript harness (via cedar-wasm), and a Python harness (via cedarpy), evaluated identically.² The control plane maps cleanly onto existing infrastructure: admission-control hooks for policy, a filter-then-score scheduler for routing, a consensus store such as etcd for durable state with no split-brain, and OpenTelemetry GenAI spans (invoke_agent, execute_tool) for audit.¹

Multi-agent: the same gates, applied to spawns¶

Parallel sub-agents add three decisions the model also cannot see: how deep and wide to spawn, where data may live, and the running spend. A governed sub-agent harness routes each spawn, tool call, and inference through the same decide() chain. Admission is sequential so each spawn sees the real remaining budget, while execution runs in parallel under a semaphore. A router picks the model per role with a filter-then-score pass (capability filter, then policy filter, then a cost-dominated score). Each sub-agent runs in a reversible workspace (a git worktree) so its actions can be discarded.¹ Cedar bounds the fan-out directly: depth and width caps, a positive-remaining-budget requirement, and hard forbids on irreversible or out-of-residency actions.

Coordination is not solved¶

Multi-agent systems fail in ways single agents do not: handoff loses context, sub-agents duplicate or contradict each other, and no one owns the final synthesis. Failure-attribution work shows recall on cross-agent handoffs dropping sharply once work crosses an agent boundary, and a published taxonomy catalogues the recurring multi-agent failure modes. Treat multi-agent as a cost to justify, not a default.³

Don't-miss checklist¶

Route every tool call and every sub-agent spawn through one decide() chain; no side doors.
Make denials structured, named replan signals, never bare network errors.
Enforce a hard budget cap with per-action cost estimates; deny before overspend.
Bind a short-lived brokered identity at allow-time; never act under a shared service account (policy and governance).
Keep durable state in a consensus store, not in agent memory.
Justify every additional agent against the coordination cost it adds.

Failure modes¶

Data plane in charge. The model decides policy and budget; nothing can refuse an action before it runs.
Sidecar denials. A 403 from a network proxy is indistinguishable from an outage, so the model retries or gives up instead of replanning.
Unbounded consumption. No budget gate; an agent loops and runs up a large bill (OWASP LLM10).
Identity collapse. Every action runs as one service account; the audit log cannot say who asked.
Spawn storms. No depth or fan-out cap; sub-agents recurse and exhaust budget or context.
Lost handoffs. Context does not survive the jump between agents; work is duplicated or dropped.

Open questions & validation¶

Reliable cross-agent context handoff remains an open problem; measure handoff recall on the target workload before trusting multi-agent.
Cost estimation per action is approximate; validate the budget gate against real token accounting.
Policy completeness: prove the deny-by-default policy is not overly permissive (policy and governance).

References¶

OWASP Top 10 for LLM Applications (LLM06 Excessive Agency, LLM10 Unbounded Consumption): https://genai.owasp.org/llm-top-10/
Cemri et al., Why Do Multi-Agent LLM Systems Fail (MAST): https://arxiv.org/abs/2503.13657
Cedar policy language: https://www.cedarpolicy.com/
Kubernetes admission controllers: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
Anthropic, Building effective agents: https://www.anthropic.com/research/building-effective-agents

Control-plane model: the data plane is model inference and tool execution; the control plane decides policy, identity, routing, budget, state, retries, and audit. A single decide(action) gate chain (authorize, mutate, budget, retries, allow) runs before any action; denials carry the named policy rule as a replan signal. Multi-agent spawns run through the same chain with sequential budget admission and parallel execution. ↩↩↩↩↩↩↩
Authorization belongs in the in-process pre-tool hook (Claude Agent SDK PreToolUse, or equivalent) so the deny reason re-enters the loop; one Cedar policy file can drive Rust, TypeScript (cedar-wasm), and Python (cedarpy) harnesses identically. ↩
Cemri et al. taxonomise multi-agent LLM failure modes; handoff-recall degradation across agent boundaries is a recurring, unsolved coordination problem. ↩