Governing self-modifying agents¶
Scope: the controls required once an agent optimizes itself, editing its own prompt, tools, or middleware at runtime. The governance question shifts from "is the agent safe" to "is the optimizer safe," and ordinary change management does not fit. This is the control layer over the capability described in self-improving harnesses; it composes with the policy gate and the control plane.
Contracts and configs here are reference templates; pin versions and validate before relying on them.
flowchart LR
OPT["Optimizer loop: observe, diagnose, propose"] --> CONTRACT["Change contract (predicted delta + falsifier + rollback + expiry)"]
CONTRACT --> SHADOW["Shadow harness on real traffic"]
SHADOW --> PROMO{"Two-track promotion"}
PROMO -->|"auto / human-gated"| LIVE["Promote"]
PROMO -->|"hard-blocked"| STOP["Refuse (outside the agent)"]
INV["Invariants the optimizer cannot reach"] --> PROMO
Overview¶
A meta-agent that rewrites its own harness is changing production behaviour without a human writing the change. CI/CD-shaped change management is mismatched on every axis: the granularity is a prompt or tool edit rather than a code commit, the reviewer is a model rather than a person, the test contract and rollback are different, and the authorship is the system itself. The optimizer therefore needs its own control plane, and one hard floor it can never cross.1
Core knowledge¶
What the optimizer can touch¶
A self-improving harness can edit a defined set of components: the system prompt, tool descriptions, tool implementations, middleware, skills, sub-agent configurations, and long-term memory. Governance starts by enumerating that surface, because anything editable is something that can regress or be subverted.1
Change contracts¶
Every proposed self-edit carries a machine-readable change contract: the component it targets, the predicted effect (with a confidence interval), a falsifier test that states what must not regress and by how much, a rollback plan (revert pointer plus any state cleanup), and an expiry (promote within a window or auto-revert). A change without a contract does not ship. The contract is what makes an automated change auditable and reversible.1
Two-track promotion with hard floors¶
Route every proposed change into one of three tracks: auto-promote for low-risk, contract-satisfying edits; human-gated for medium-risk edits; and hard-blocked for changes that must never be automated. The hard-blocked decision is enforced outside the agent, so the optimizer cannot route around it.1
Shadow before promote¶
Run the proposed harness against real traffic in shadow, comparing its behaviour distribution to the incumbent before promotion. A change that looks good on a held-out set can still shift the live distribution; shadow evaluation catches that, and it must run on a fixed substrate so the measured delta is the change and not drift (evaluation).1
Invariants and signed external metrics¶
Define formal invariants the optimizer cannot reach (for example, the policy and identity controls must remain enforced), and keep the success metrics it optimizes against in a separate, signed store so the optimizer cannot grade its own work or move the goalposts. An optimizer that owns its own scoreboard will reward-hack it.1
The recursion floor¶
The non-negotiable floor: the optimizer must never be able to edit its own governance, the invariants, the change-contract checker, or the metric store. This cannot be enforced by the harness it is editing; it has to be enforced outside, by capability-based security, so the editable surface structurally excludes the controls. Without this floor, every other control is advisory.1
Don't-miss checklist¶
- Enumerate exactly which harness components the optimizer may edit; everything else is immutable.
- Require a change contract (predicted delta, falsifier, rollback, expiry) on every self-edit.
- Use two-track promotion; enforce the hard-blocked track outside the agent.
- Shadow every change on real traffic against a fixed substrate before promoting.
- Keep invariants and success metrics outside the optimizer's reach, in a signed store.
- Enforce the recursion floor with capability-based security, not with the harness being edited.
Failure modes¶
- Ungoverned self-edit. A change ships with no contract; a regression reaches production unattributed.
- Reward hacking. The optimizer grades itself and games the metric instead of improving behaviour.
- Recursion breach. The optimizer edits its own guardrails, and every downstream control becomes advisory.
- Drift mistaken for gain. A change is promoted on a moving substrate; the apparent improvement is noise.
- No rollback. A promoted change cannot be reverted cleanly because the contract lacked a rollback plan.
Open questions & validation¶
- Self-modifying-agent governance is early; test the recursion floor adversarially rather than assuming it holds.
- Predicted-delta confidence intervals are often miscalibrated; validate them against realized outcomes.
- Behavioural-distribution monitoring needs a baseline; confirm the shadow comparison is statistically sound.
References¶
- Externalization in LLM agents (survey): https://arxiv.org/abs/2604.08224
- NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
- ISO/IEC 42001 (AI management system): https://www.iso.org/standard/81230.html
- EU AI Act (official explorer): https://artificialintelligenceact.eu/
Related: Self-improving harnesses · Agent policy engine · Orchestration & control plane · Evaluating agents · Harness architecture · Agentic systems
-
Governing a self-modifying agent means governing the optimizer: enumerate the editable component surface; require a change contract (predicted delta, falsifier test, rollback, expiry) per edit; use two-track promotion with a hard-blocked track enforced outside the agent; shadow on real traffic against a fixed substrate; keep invariants and success metrics outside the optimizer's reach; and enforce a recursion floor (the optimizer cannot edit its own governance) via capability-based security. ↩↩↩↩↩↩↩