Loop engineering: designing systems that prompt your agents¶
Scope: the practice of building scheduled, self-feeding loops around agents, one floor above the harness. The harness arms a single run (tools, allowed actions, what counts as done); the loop makes runs discover their own work, verify it, persist state, and wake again without a human in the inner cycle. This page covers the four-layer stack and how failure blast radius grows with each layer, the five moves of one turn, the six parts that realize them, generator/evaluator separation, the five loop anti-patterns, two production case studies, the four silent costs, and the discipline that keeps the builder in control. The inner model-tool cycle itself is the agent loop; its token cost is agentic loop economics; loops that optimize their own scaffolding are self-improving harnesses, and loops that search ML configurations are autonomous experimentation loops.
Distilled from the June 2026 wave of practitioner writing: Addy Osmani's essay that named the practice, and the HuaShu Orange Book working note that systematized it. Product specifics quoted here (command names, version gates, scheduler intervals) are that note's June 2026 claims; verify against current tool documentation before relying on them. The Python block below is executed and asserted; the skill and evaluator snippets are reference templates.
flowchart TB
P["Prompt engineering<br/>the words for one exchange"] --> C["Context engineering<br/>what fills the window right now"]
C --> H["Harness engineering<br/>arming a single run: tools, actions, done"]
H --> L["Loop engineering<br/>scheduling on the harness: run itself, over and over"]
L -.->|"each layer up, a mistake survives more turns before discovery"| P
Overview¶
Loop engineering is replacing oneself as the person who prompts the agent, and designing the system that does it instead. The earlier "XX engineering" terms (prompt, context, harness) all assume a human at the keyboard directing the agent line by line; loop engineering deletes that assumption. The practitioner moves from inside the loop to outside it: what they write is no longer words for the agent but a thing that automatically sends words to the agent, on a schedule, with its own output feeding the next round.
The term surfaced in a single week of June 2026. A post by Peter Steinberger (author of OpenClaw) argued that one should no longer be prompting coding agents but designing the loops that prompt them; Boris Cherny (Claude Code lead at Anthropic) said at nearly the same moment that his job had become writing loops that prompt Claude; and Addy Osmani named the practice in writing on June 7, pulling both together. The timing was not coincidence: coding agents had become reliable enough to finish non-trivial tasks unattended, scheduling primitives had appeared in the major harnesses, and per-run cost had fallen far enough that rerunning on a timer stopped looking wasteful.
The load-bearing intuition is about blast radius. The same underlying mistake (say, an agent misreading what a function returns) is caught immediately at the prompt layer, shows up as one wrong answer at the context layer, and produces one reviewable diff at the harness layer. At the loop layer it is written into a state file, read back the next morning as established fact, and built upon for days; by the time anyone looks, the wrong assumption is load-bearing. The cost of a mistake scales with the number of turns it survives before someone catches it, and a loop is, by construction, a machine for maximizing the number of turns. Everything else on this page (the evaluator, the caps, the human checkpoint) exists to shorten the distance between a mistake and its discovery.
Core knowledge¶
The four-layer stack¶
The layers are not replacements for one another; they stack, and each minds something one size larger than the layer below.
| Layer | What it minds | Core question |
|---|---|---|
| Prompt engineering | Writing one good prompt | What should the model be told |
| Context engineering | The window's contents now | What to retrieve, summarize, clear |
| Harness engineering | Arming a single run | Which tools, which actions, what counts as done |
| Loop engineering | Scheduling on the harness | How to make it run itself, over and over |
Each layer fails differently, so the check that can say "no" must be installed in a different place. A bad prompt is caught on the spot; bad context produces a confidently wrong answer; a bad harness action ends with a visible diff. A bad loop runs while its builder sleeps, changes code nobody looked at, and feeds its own errors into the next round. The higher the layer, the farther the human is from the scene, and the longer mistakes pile up. The real difficulty of loop engineering is therefore never building the loop; it is putting something inside the loop that can stop it.
The five moves of one turn¶
A turn of a loop is not idle spinning. It finds work worth doing, hands the work to an agent, verifies the result, saves state, and decides the next step. Drop any one move and the loop will not turn, or will turn in place.
flowchart LR
D["Discovery<br/>find this turn's work"] --> HO["Handoff<br/>one isolated worktree per task"]
HO --> V{"Verification<br/>an independent check that can say no"}
V -->|"pass"| PER["Persistence<br/>PR, ticket, state file on disk"]
V -->|"reject"| HO
PER --> S["Scheduling<br/>the next turn wakes on its own"]
S --> D
The canonical small example is Osmani's morning triage loop. An automation fires each morning and invokes a triage skill that reads yesterday's failing CI runs, still-open issues, and recent commits, then writes actionable findings to a state file. Each finding worth doing gets an isolated git worktree; one sub-agent drafts the fix and a second reviews it against the project's tests. A connector opens the pull request and updates the ticket; anything uncertain goes to an inbox for a human; the state file survives so the next day picks up where this one stopped.
- Discovery decides what this turn should do, and it sets the ceiling on the loop's value: surface worthless work and the other four moves execute beautifully in service of nothing. The discovery logic belongs in a maintained skill, not a wall of instructions pasted into a schedule nobody updates.
- Handoff moves each task into an isolated working directory so parallel agents do not edit the same files. The cleaner the task is cut, the easier verification and merging become.
- Verification is the move most often skipped and least affordable to skip. The agent that wrote the code grades its own homework too softly; a dedicated reviewer with different instructions catches what the author talked itself into. A loop without a real check is an agent nodding at itself.
- Persistence lands results somewhere that survives the conversation: a PR, a ticket, a state file, an inbox. A loop's memory cannot live in a context window that gets flushed.
- Scheduling turns one run into a loop: a timer or event trigger, plus carried-over state, means unfinished work continues tomorrow without a button press.
The six parts¶
The moves describe what happens in a turn; the parts are what must exist for it to turn at all, and they map onto the moves nearly one to one.
| Part | What it is | Realizes |
|---|---|---|
| Automations | Runs off a schedule or trigger | Scheduling |
| Worktrees | Isolated git working directories per parallel agent | Handoff |
| Skills | Project knowledge made permanent in one file | Discovery |
| Connectors | MCP hookups to trackers, databases, APIs, chat | Persistence, Discovery |
| Sub-agents | The writer separated from the judge | Verification |
| Memory | Persistent state on disk, outside any conversation | Persistence |
Two of these deserve expansion. Skills pay off what the note calls intent debt: the recurring cost of re-explaining what the project is, what the rules are, and where the traps are. A skill file can be reused and maintained; a pasted prompt rots in a cron job. Memory is distinct from context: context is what the agent sees this round and is flushed on refresh; memory persists across rounds and days, in markdown files or a board the loop commits back. The agent forgets; the repo does not. Connectors (built on the Model Context Protocol) decide the loop's radius of vision: a loop that can only see the filesystem is a tiny loop. See context and memory for the inner-loop view of the same split.
Generator and evaluator¶
Asked to grade its own output, an agent tends to praise it, even when the quality is plainly mediocre. This is not a capability problem but a structural one: the context in which the code was written is already full of the reasons it was written that way, so the author sees its own chain of self-persuasion rather than the result. Inside a loop the flaw compounds every round.
The tractable fix, observed by Anthropic engineer Prithvi Rajasekaran while building long-running applications, is not to make the generator self-critical but to tune a separate skeptical evaluator: a different agent, with different instructions, sometimes a different model, that looks at the output from scratch. The pattern borrows from GANs (one network builds, one picks faults). Three calibrations matter:
- The evaluator should act, not just read. An evaluator that only reads code judges "does this look right." Hooked to a browser via a Playwright MCP connector, it can open the page, click, screenshot, and inspect the DOM, judging "does it run right."
- Default to doubt. A common community calibration instructs the evaluator to assume the code is broken until proven otherwise.
- Swap the model where possible. The same model with new instructions often keeps its blind spots.
A reference evaluator specification, condensed from the note (reference template, adapt per project):
ROLE: adversarial reviewer. ASSUME the code is BROKEN until proven otherwise.
CHECK in order: (1) does it run (execute, do not read); (2) run the tests and
paste real output; (3) edge cases the author skipped; (4) does behavior match
the ticket. Judge behavior via tools, not intent. VERDICT: PASS only if every
check holds; otherwise REJECT with each reason listed.
The completion decision belongs to neither the generator nor the evaluator's session: run-until-condition primitives (Claude Code's /goal, per the note; Codex reaches the same via automation reruns plus a judge) have a fresh model check the stop condition after each turn. This is the maker-checker principle, decades old in banking: the party doing the work and the party approving it must differ. A loop's floor is its evaluator: the generator's level decides what the loop can produce, the evaluator's level decides what it will not produce.
The structure, executed in miniature¶
The two structural claims above (persistence is what makes progress cumulative; an independent executed check is what keeps defects from shipping) are bookkeeping, so they can be validated without any model. The block below is executed and asserted: an amnesiac loop spends the same effort as a persistent one but regrinds one turn's capacity forever, and a pipeline whose only gate is the author's self-grade ships every defect while an evaluator that executes a test ships none, on identical drafts.
# loop_moves.py: validates two structural claims of loop engineering in miniature,
# with no LLM involved. (1) Persistence: a loop that does not write state outside
# the turn makes no cumulative progress; it regrinds one turn's capacity forever.
# (2) Verification: if the only check is the author's self-grade (assumed to
# approve its own defects, the note's empirical claim), every defect ships; an
# independent evaluator that executes a test ships none of them. stdlib only.
from __future__ import annotations
import random
def run_turns(items: list[str], turns: int, capacity: int,
persistent: bool) -> tuple[set[str], list[str]]:
"""Scheduled turns over a work queue. `durable` stands in for the state file;
an amnesiac loop starts every turn from a flushed context instead."""
durable: set[str] = set()
performed: list[str] = []
for _ in range(turns):
state = set(durable) if persistent else set()
for item in [i for i in items if i not in state][:capacity]:
performed.append(item)
state.add(item)
if persistent:
durable |= state
return durable, performed
items = [f"finding-{i:02d}" for i in range(12)]
done, work = run_turns(items, turns=4, capacity=3, persistent=True)
assert done == set(items), "persistent loop finishes all 12 items in 12/3 = 4 turns"
assert sorted(work) == sorted(items), "each item is worked exactly once, none redone"
done, work = run_turns(items, turns=4, capacity=3, persistent=False)
assert done == set(), "nothing outlives the flushed context window"
assert set(work) == set(items[:3]), "distinct items ever touched = one turn's capacity"
assert len(work) == 12, "same effort spent (12 work units), all of it redone work"
rng = random.Random(0)
drafts = [rng.random() < 0.3 for _ in range(200)] # True = a defective draft
self_defects = sum(drafts) # self-grade approves all of its own drafts
gated = [d for d in drafts if not d] # evaluator runs the defect test and rejects
assert self_defects > 0, "some drafts are defective under this seed"
assert sum(gated) == 0, "no defect survives an executed independent check"
assert sum(gated) < self_defects, "evaluator ships strictly fewer defects, same drafts"
assert len(gated) == 200 - self_defects, "evaluator rejects exactly the defective drafts"
print(f"persistent: 12/12 done in 4 turns; amnesiac: 0 durable, {len(set(work))} "
f"distinct items reworked; self-graded defects shipped: {self_defects}/200; "
f"evaluator-gated defects shipped: 0/{len(gated)}")
Output: persistent: 12/12 done in 4 turns; amnesiac: 0 durable, 3 distinct items reworked; self-graded defects shipped: 52/200; evaluator-gated defects shipped: 0/148. The simulation takes the note's empirical claim (authors approve their own defects) as an assumption and validates its arithmetic consequence: with identical drafts, only the independent executed check changes what ships.
Loops in production¶
One engineer's morning. Osmani's triage loop, described above: one person, one machine, the grunt work done each morning. The detail worth flagging is that the automation invokes a named skill rather than a block of instructions glued into the schedule, so the discovery logic is versioned and maintained.
Stripe's Minions. The enterprise-scale case, described by Stripe engineer Steve Kaliski on the How I AI podcast: more than 1,300 machine-written pull requests merged per week. The trigger is light (@ the bot in Slack, or an emoji reaction). What makes it reliable is the stretch before the model wakes: a deterministic orchestrator assembles context first, scanning links, pulling Jira, and locating relevant code via Sourcegraph plus MCP, because letting the LLM find its own context is the least controllable part. The pipeline interleaves hard-coded gates with LLM steps: the agent writes code, a hard-coded step runs the linter (the agent cannot skip it), the agent fixes the lint, a hard-coded step commits. Anything deterministic logic can solve never goes to a probabilistic model. Minions is a fork of the open-source agent Goose, not a stronger model: its core claim is that reliability comes from the quality of the constraints, not model size. Sandboxes are Devbox environments on EC2 run as cattle, swapped at will, so an engineer can spin up several minions in parallel; across the fleet that lands on the order of 1,300 merged PRs per week. And the 1,300 weekly PRs are still reviewed by humans: the human did not leave, but changed desks, from writing to reviewing.
Scheduling options. Per the note (June 2026 values; verify against current docs): local recurring runs and desktop scheduled tasks support minute-level intervals and can see local files, but stop when the machine turns off; cloud schedules run with the machine off at the cost of a coarser minimum interval (about an hour) and a clean clone each run. The rule is mechanical: work glued to the local machine (watching a dev server every minute) must run locally; work that should survive a closed laptop lid (scanning open issues at 3 a.m. and opening PRs) belongs on a cloud or CI schedule. A mature loop often uses both: local for tight inner checks, cloud for the overnight sweep. Conflating "rerun a few extra rounds while present" with "run while absent" is how loops thought to be autonomous quietly stop.
The note also cautions that widely circulated figures (such as "around 90% of Claude Code is written by itself") are secondhand summaries; the cases above trace to firsthand sources. And the capabilities are not product-specific: scheduling, run-until-condition, parallel isolation, sub-agents, external connectors, and explicit skills exist across toolchains under different names. The question for a team is whether all six are present, not which brand provides them.
The four silent costs¶
A loop that runs itself makes mistakes by itself, and four costs accrue without sounding any alarm while it runs.
- Verification debt. Every unread merged PR converts saved time into unverified output waiting in the gap between "runs" and "right," coming due all at once on some shipping morning. Guard: the independent evaluator.
- Comprehension rot. The codebase grows faster than the mental map of it; the gap is invisible until a bug lands in a corner nobody ever read. Guard: read a sample of the loop's output daily and force an explanation of each sampled change.
- Cognitive surrender. The more reliable the loop, the easier it is to stop having an opinion and take whatever it hands back. Guard: the loop can execute, but it cannot decide; the human must remain able to say "this is wrong."
- Token blowout. The only cost that hits the bill directly: helpers, retries, and overnight rounds mean one idle bug can burn a night's quota. Guard: hard caps (per-run budget, daily budget, max retries) set before the loop first runs unattended. The per-turn mechanics behind this cost are covered in agentic loop economics.
The four reinforce one another: unverified output erodes understanding, eroded understanding invites surrender, surrender lets the loop run longer and spend more, and the spend produces more unverified output. The note's worked example: a loop opens twenty green-tested PRs overnight, three carry subtle errors the tests miss (verification debt); the human merges all twenty unread, so their model of the codebase lags by twenty changes (comprehension rot); the smooth run convinces them to stop reading tomorrow's batch (surrender); free retries triple the bill (blowout). The three buried errors now live in a codebase the human no longer fully understands, guarded by a human who has stopped looking.
Operational discipline and the economics of judgment¶
Loops make generation nearly free: code, plans, fixes, and PRs become abundant, and one engineer with a good loop produces a small team's output. What stays scarce is judgment, the ability to tell "looks reasonable" from "is right"; the loop can generate a hundred candidates but chooses only on plausibility. The amplifier cuts both ways: a good decision is executed a hundredfold, and so is a bad one, with no slow gear left to catch it mid-flight. An engineer whose value was mechanical labor finds it evaporating; one whose value is judgment finds it multiplied. The same loop, built by two people, yields opposite outcomes six months later, and the difference is one or two checkpoints.
Three standing practices follow. Read a representative sample of the loop's output every day and explain each sampled change; failure to explain is the precise signal that the mental map has fallen behind. Cap before shipping: per-run budget, daily budget, and retry limits are circuit breakers, not cost optimization, and a loop without caps has delegated spending authority to its own bugs. Keep one door open: at least one checkpoint where the loop pauses for a human, kept permanently, not as scaffolding to remove once trusted; the pause is what keeps the human capable of intervening at all. For the control-plane view of these guardrails, see governing self-modifying agents.
Don't-miss checklist¶
- Name the discovery source: what does the loop read on a timer (CI, issues, commits, an inbox)? If a human still hands it work, the expensive part is not automated.
- Put discovery logic in a versioned skill, not in the schedule definition.
- Name the state file: which disk artifact holds cross-round memory, and who commits it?
- Install an independent evaluator that executes (runs tests, clicks the page), defaults to doubt, and can reject; completion is judged by a fresh model, maker-checker style.
- One isolated worktree per parallel task; parallelism without isolation only fails once there are several agents.
- Set token caps (per run, per day, max retries) before the first unattended run.
- Keep one human review point permanently: PRs opened, never auto-merged; uncertainty lands in an inbox.
- Grow in the safe order: prove the evaluator catches real mistakes before adding parallelism; a loop earns more agents by first stopping a single bad one.
- Read a daily sample of the loop's output and explain it; budget the reading time as part of the loop's cost.
Failure modes¶
Each anti-pattern is one of the five moves skipped; they cluster, because a team careless about one check is usually careless about the rest.
- The nodding loop (verification skipped). The same agent writes and approves; plausible mistakes accumulate at machine speed. Symptom: hundreds of turns with not one self-rejection, a statistical impossibility for real work and therefore proof no real check exists. Fix: the generator/evaluator split.
- The amnesiac loop (persistence skipped). Results live only in a flushed context window; each morning starts from the same place, rediscovering or conflicting with yesterday's work. Symptom: no cumulative progress. Fix: a state file on disk.
- The manual loop (scheduling skipped). Four good moves, no trigger; it is a script the human runs by hand and forgets. Symptom: the last run was the day it was demoed. Fix: a timer or event trigger that does not depend on memory.
- The blind loop (discovery skipped). The human still spends every morning deciding what the loop should do; the doing is automated, the finding is not, and finding is often the expensive part. Fix: move discovery into a skill.
- The tangled loop (handoff skipped). Parallel agents share one working directory and their edits collide. Symptom: appears only under parallelism; a single-agent loop looks fine. Fix: one worktree per task.
Open questions and validation¶
- Which numbers hold up. The Stripe and triage cases trace to firsthand accounts; headline figures circulating around the practice are mostly secondhand summaries. Treat any adoption statistic as unverified unless the primary source is named.
- Version and feature drift. Command names, version gates, and scheduler minimums quoted from the June 2026 note (local recurring runs, run-until-condition primitives, cloud schedules) change quickly; validate against the current documentation of whatever toolchain is in use.
- Evaluator quality is unmeasured in most deployments. The nodding-loop symptom (zero rejections) is a necessary check but not sufficient; a useful practice is seeding known-bad changes and measuring the evaluator's catch rate, the same fixed-substrate discipline as agent evaluation.
- Same-model evaluators. The note reports that swapping instructions without swapping the model often preserves blind spots; how much of the generator/evaluator gain survives when both roles run the same base model is an open, measurable question.
- Where the loop boundary sits. A loop that edits its own skills or promotion rules crosses into self-improving harness territory and needs the corresponding governance controls; the line between "loop that does work" and "loop that changes how work is done" is the right place for a hard human gate.
References¶
- Addy Osmani, "Loop Engineering" (the essay that named the practice, June 2026): https://addyo.substack.com/p/loop-engineering
- Addy Osmani, "Loop Engineering" (blog edition): https://addyosmani.com/blog/loop-engineering/
- HuaShu Orange Books, "Loop Engineering: Stop Asking Me What It Is" (v260615, June 2026), the working note this page distills, including the five moves, six parts, anti-patterns, and case studies: https://huasheng.ai/orange-books
- Peter Steinberger's writing on agent loops and OpenClaw: https://steipete.me
- Goose, the open-source agent framework Stripe's Minions forks: https://github.com/block/goose
- Model Context Protocol (connectors): https://modelcontextprotocol.io
Related: Agent loop · Harness architecture · Self-improving harnesses · Agentic loop economics · Governing self-modifying agents · Evaluating agents · Agentic AIOps · Local coding agents · Context and memory · Use as an agent skill