Skill optimization: training agent skills in text space (SkillOpt)¶
Scope: optimizing a single agent skill document with training-loop discipline, as formulated by SkillOpt (arXiv 2605.23904). This page covers the skill as the external trainable state of a frozen model, the rollout/reflection/edit/gate cycle, the textual learning-rate budget, the rejected-edit buffer, the epoch-wise slow/meta update, and what transfers across models, harnesses, and benchmarks. The optimization target here is the skill file; searching over the surrounding scaffold is self-improving harnesses, and installing this KB as a skill is use as an agent skill.
The numpy example below is executed and asserted (algorithm-shape simulation, no LLM calls). The workflow snippets describing SkillOpt itself are reference descriptions of the paper's released pipeline; pin the repo version before relying on them.
flowchart TB
ROLL["Rollout batch on train split (frozen target model, skill in context)"] --> REFL["Minibatch reflection by optimizer model (failures first, then successes)"]
REFL --> MERGE["Hierarchical merge and ranking of add/delete/replace edits"]
MERGE --> CLIP["Clip to edit budget L_t (textual learning rate, cosine decay)"]
CLIP --> CAND["Candidate skill (protected slow-update region untouched)"]
CAND --> GATE{"Strictly better on held-out selection split?"}
GATE -->|"yes"| CUR["New current skill; best exported as best_skill.md"]
GATE -->|"no, ties rejected"| BUF["Rejected-edit buffer: edits tried plus score drop"]
BUF --> REFL
CUR --> ROLL
CUR --> SLOW["Epoch boundary: slow/meta update, still gated"]
SLOW --> ROLL
What it is¶
SkillOpt is a text-space optimizer for agent skills: it trains the skill document, not the model. A skill is a natural-language policy inserted into the agent context before execution (prepended to the system prompt in direct chat, or rendered as persistent procedural memory such as a per-task SKILL.md in a coding harness). SkillOpt holds the target model, harness, and evaluator fixed and treats the skill as the external state of the agent: given train, selection, and test splits, it repeatedly samples scored rollout batches from the train split, has a separate frontier optimizer model reflect on failures and successes in minibatches, and proposes structured add/delete/replace edits to the one skill file.1
The deep-learning analogy is operational, not decorative. The edit budget L_t is a textual learning rate: after hierarchical merging, the optimizer ranks the edit pool and clips it to the top L_t edits per step (default 4, cosine-decayed to a floor of 2), so consecutive skill versions stay close enough that the optimization history remains meaningful. The validation gate plays the role of validation: a candidate skill is accepted only when its held-out selection score is strictly greater than the incumbent's (ties are rejected), and only the best accepted skill is exported as best_skill.md. Rejected edits are not discarded: an epoch-local rejected-edit buffer records what was tried and the score drop it caused, and later reflection calls receive it as negative feedback. An epoch-wise slow/meta update compares the same training items under the previous and current epoch's skill, writes longitudinal guidance into a protected region of the document that step-level edits cannot overwrite, and passes through the same gate.2
The deployed artifact is a compact best_skill.md of roughly 300 to 2,000 tokens. Deployment adds zero optimizer calls and zero weight updates: the frozen target model simply runs with a better document in context.
Why use it¶
- It is the strongest measured no-weight-update adaptation in its study. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied-best on all 52 evaluated cells, beating human-written skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill.3
- Large gains where procedure matters. On GPT-5.5 direct chat the six-benchmark average rises from 58.8 to 82.3 (+23.5 points); SpreadsheetBench goes 41.8 to 80.7 and OfficeQA 33.1 to 72.1, where strict tool and answer-format discipline dominates. Inside agentic loops the same interface lifts GPT-5.5 by +24.8 (Codex) and +19.1 (Claude Code) over no skill.3
- The artifact is auditable and cheap to ship. Final skills stayed under 2,000 tokens and were assembled from only 1 to 4 accepted edits per benchmark; a practitioner can read the entire deployed adaptation in minutes, which no weight-space method offers.4
- Training cost is paid once and amortized. Cost per absolute test point ranged from 0.6M training tokens (SpreadsheetBench) to 46.4M (DocVQA); after export the skill is free at inference time and reusable across nearby models and harnesses.4
- Weak targets gain the most. Averaged over benchmarks, GPT-5.4-nano gains +26.7 points and Qwen3.5-4B +19.2; a compact skill supplies procedural knowledge small models do not hold in weights.3
When to use it (and when not)¶
- Use it when the task has a reliable scorer (exact match, executable checks, verifiers) and a stable task distribution: the whole loop is propose-and-test against a held-out split, so the gate is only as good as the score.
- Use it when weight adaptation is unavailable (closed frontier models) or not worth the cost (one skill trained offline by a strong optimizer model can be reused across model scales and harnesses).
- Do not use it for open-ended domains where success is subjective or expensive to judge; the paper flags this as the main applicability limit, and a noisy gate silently converts the loop into unvalidated self-editing.7
- Do not expect one skill to cover heterogeneous domains. SkillOpt intentionally trains a single portable document, not a skill library; disjoint procedures need separate runs or a different architecture.
- Budget for the training loop. One-off tasks rarely repay tens of millions of rollout tokens; the method is for skills that will be reused.
Architecture¶
One optimization step runs forward, backward, update, gate. Forward: the frozen target model executes a rollout batch (default 40 tasks) from the train split with the current skill; the harness records messages, tool calls, observations, verifier feedback, and benchmark context. Backward: the optimizer model splits trajectories into failure and success reflection minibatches (default size 8, 16 analyst workers in parallel, up to 3 refinement rounds), because single trajectories yield anecdotal fixes while minibatches expose recurring procedural errors. Failure minibatches propose corrective rules; success minibatches protect behavior that already works. Update: proposals are merged hierarchically (failure corrections take priority; duplicates and contradictions filtered; each merged edit carries a support count), ranked, and clipped to the budget L_t. Patch mode applies four atomic operations (append, insert_after, replace, delete); rewrite mode is the looser alternative. Gate: the candidate is scored on the selection split with the same frozen model and harness, and accepted only on strict improvement; every step also writes an edit_apply_report.json so the provenance of every line in best_skill.md is recoverable.12
Two slower mechanisms sit above the step loop. The slow update samples 20 training items per epoch under the previous and current skill, groups outcomes into improvements, regressions, persistent failures, and stable successes, and writes a concise guidance block into the protected SLOW_UPDATE_START/SLOW_UPDATE_END region. The meta skill is optimizer-side only: a summary of which edit patterns helped or failed, prepended to future optimizer prompts and never shipped with the agent. Removing both dropped SpreadsheetBench from 77.5 to 55.0 in the ablations, the largest single degradation; removing only the rejected-edit buffer cost 1.6 to 4.6 points per benchmark.5
How to use it¶
The published pipeline (code at the reference link) is harness-agnostic through an adapter interface: an adapter builds train and evaluation batches, injects the current skill into the agent context, runs the native harness, and returns scored trajectories. The same optimizer therefore drives direct chat, Codex-style, and Claude Code-style loops, and all three consume the same best_skill.md format. Defaults that matter when reproducing the paper: four epochs, rollout batch 40, reflection minibatch 8, L_t = 4 with cosine decay to 2, strict-improvement gating, slow update over 20 sampled tasks, patch edit mode, and a 2:1:7 train/selection/test split at split_seed=42.1
What the gate and the budget buy is testable without any LLM. The following executed example represents a skill as a feature vector whose score mixes true transferable quality with split-specific idiosyncrasy (the analogue of dev-set quirks an editor can overfit). It compares ungated greedy self-revision against a SkillOpt-shaped loop, and asserts the gated loop is monotone on held-out score, that rejected edits leave the skill untouched, and that only the ungated editor ends with a real train/validation gap:
# skillopt_sim.py - validated: held-out gating plus a bounded edit budget beats
# ungated greedy self-revision. Algorithm-shape simulation in numpy (no LLMs):
# the "skill" is a feature vector, task idiosyncrasy is exploitable noise.
import numpy as np
rng = np.random.default_rng(42)
D = 24 # skill feature dimensions
S_STAR = rng.normal(0.0, 1.0, D) # ideal procedural skill
def make_tasks(n: int) -> np.ndarray: # one idiosyncrasy vector per task
return rng.normal(0.0, 1.0, (n, D))
TRAIN, VAL = make_tasks(40), make_tasks(40) # rollout split vs selection split
def score(skill: np.ndarray, tasks: np.ndarray) -> float:
quality = float(np.exp(-np.sum((skill - S_STAR) ** 2) / D)) # true transferable quality
idio = 0.25 * float(np.mean(np.tanh(tasks @ skill))) # exploitable, split-specific
return quality + idio
def propose(skill: np.ndarray, k: int) -> np.ndarray:
"""Reflection analogue: a noisy edit biased toward the ideal skill, touching k coords."""
direction = 0.5 * (S_STAR - skill) + rng.normal(0.0, 0.8, D)
mask = np.zeros(D); mask[rng.choice(D, size=k, replace=False)] = 1.0
return skill + 0.4 * direction * mask
def ungated_run(steps: int) -> np.ndarray: # accepts any train-score gain, no budget
skill = np.zeros(D)
for _ in range(steps):
cand = propose(skill, D) # unbounded: every coordinate may move
if score(cand, TRAIN) > score(skill, TRAIN):
skill = cand
return skill
def gated_run(steps: int) -> tuple[np.ndarray, list[float], int]:
"""SkillOpt shape: cosine-decayed edit budget, strict held-out acceptance."""
skill = np.zeros(D)
val_curve = [score(skill, VAL)]
rejected = 0
for t in range(steps):
budget = max(2, int(round(2 + 2 * np.cos(np.pi * t / (2 * steps))))) # 4 -> 2
cand = propose(skill, budget)
before = skill.copy()
if score(cand, VAL) > score(skill, VAL): # strictly better or rejected
skill = cand
else:
rejected += 1
assert np.array_equal(skill, before), "rejected edit mutated the skill"
val_curve.append(score(skill, VAL))
return skill, val_curve, rejected
greedy = ungated_run(60)
gated, curve, n_rej = gated_run(60)
gap_greedy = score(greedy, TRAIN) - score(greedy, VAL)
gap_gated = score(gated, TRAIN) - score(gated, VAL)
assert all(b >= a for a, b in zip(curve, curve[1:])), "gated val score must be monotone"
assert n_rej > 0, "no edit was ever rejected; the gate was not exercised"
assert gap_greedy > 0.05, f"ungated should overfit train idiosyncrasy, gap={gap_greedy:.3f}"
assert gap_greedy > 2 * abs(gap_gated), "gating should shrink the train/val gap"
print(f"ungated: train={score(greedy, TRAIN):.3f} val={score(greedy, VAL):.3f} gap={gap_greedy:.3f}")
print(f"gated: train={score(gated, TRAIN):.3f} val={score(gated, VAL):.3f} gap={gap_gated:.3f} "
f"rejected={n_rej}/60 val {curve[0]:.3f}->{curve[-1]:.3f}")
Output: ungated: train=0.909 val=0.845 gap=0.064 and gated: train=0.843 val=0.847 gap=-0.004 rejected=36/60 val 0.504->0.847. The gated loop rejects 60 percent of proposals and still matches the ungated editor on held-out score, with no train-side inflation. That rejection rate mirrors the paper's edit economy: gains of up to +39 points came from just 1 to 4 accepted edits, with the bulk of the optimizer's search rejected at the gate.4
How to develop with it¶
- Write the scorer first. The gate needs
r(s)in[0, 1]per task from the benchmark's native evaluator; if the score is noisy, size the selection split so a strict improvement outruns the noise before trusting any acceptance. - Give the optimizer real traces, not just answers. The Codex adapter feeds back a compact execution trace so the optimizer learns from what the agent actually did; reflection over final answers alone reproduces anecdotal prompt tweaking.
- Start from a minimal seed skill. Initial skills in the paper were 16 to 516 tokens; the loop grows them to 379 to 1,995 tokens. Seeding with a bloated hand-written document wastes budget on deletions.
- Prefer patch mode. Bounded local edits preserve continuity and keep the rejected-edit history meaningful; the no-budget ablation ("without lr") lost ground on all three ablated benchmarks.5
- Use the strongest optimizer model available for training. Optimizer strength is a training-time lever with zero deployment cost; a target-matched optimizer still recovered 56 to 74 percent of the frontier optimizer's gain, so the loop, not distillation, carries much of the value.6
How to maintain it¶
- Keep the audit trail.
edit_apply_report.jsonplus the gated history make every line of the deployed skill attributable; treat the skill file like reviewed code, with the report as its blame log. - Re-gate after any environment change. A skill is validated against one model, harness, and task distribution; on upgrade, rerun the selection split rather than assuming the artifact carries over, then let transfer results set expectations: cross-model rows were uniformly positive, and a Codex-trained spreadsheet skill moved to Claude Code at +59.7 points over that harness's no-skill baseline (22.1 to 81.8).6
- Watch for staleness the same way as for context and memory: the skill encodes training-distribution heuristics, and the paper's limitation section says exactly this; schedule periodic re-evaluation on fresh held-out data.
- Version the protected region separately. The slow-update block carries cross-epoch lessons; a maintenance edit that flattens the document and loses the protected-region contract removes the mechanism whose ablation cost 22.5 points on SpreadsheetBench.5
How to run it in production¶
- Deployment is file distribution. The output is a static
best_skill.mdconsumed identically by direct chat and coding harnesses; ship it like any other reviewed artifact, with provenance and rollback (the previous accepted skill is a complete fallback). - Zero inference overhead, bounded context cost. The artifact adds under 2,000 tokens of context and no extra model calls; budget it like a system-prompt change, not like a model change.
- Gate skill promotion like an eval gate. Treat acceptance of a new skill version as a release decision on held-out data, the same discipline as evaluating agents, and defend the scorer against gaming per evaluation integrity, since the skill optimizer will exploit any scoring loophole the same way an RL policy does.
- Skills that edit themselves in production are a different risk class. SkillOpt trains offline and ships a frozen document; if the loop is ever run against live traffic, the controls in governing self-modifying agents apply.
Failure modes¶
- Selection-split overfitting. The gate optimizes toward
D_sel; a small or unrepresentative selection split lets edits pass that fail on test. The paper's epoch checkpoints tracked held-out test alongside selection to confirm the gate selects generalizing skills; replicate that check. - Noisy scores promoting junk edits. With a noisy evaluator, "strictly greater" is a coin flip near the margin; average repeated trials or enlarge the split until the acceptance threshold exceeds score noise.
- Skill bloat without a budget. Unbounded rewrites erase working rules and stack incompatible instructions; the without-budget ablation regressed on all three benchmarks, and bloat also erodes the auditability that justifies the method.5
- Assuming prompt playbooks transfer. Cross-model and cross-harness transfer was positive in the paper, but cross-benchmark gains were small (+1.3 to +3.7 on Omni-MATH), and a skill tuned to one model's failure modes can be dead weight elsewhere; re-gate on the target cell before shipping.6
- Reward hacking of the scorer. The optimizer model will encode format tricks that satisfy the evaluator rather than the task; dense rubric or exact-match scorers with adversarial review of accepted edits are the countermeasure (see evaluation integrity).
- Silent gate bypass. Any code path that applies an edit without a selection-split win (for convenience, or on ties) converts the method back into uncontrolled self-revision, the failure class the strict gate exists to prevent.
References¶
- Yang et al., SkillOpt: Executive Strategy for Self-Evolving Agent Skills (arXiv 2605.23904): https://arxiv.org/abs/2605.23904
- Hugging Face paper page: https://huggingface.co/papers/2605.23904
- SkillOpt code (Microsoft): https://aka.ms/SkillOpt
- Agrawal et al., GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (arXiv 2507.19457): https://arxiv.org/abs/2507.19457
- Yuksekgonul et al., TextGrad: Automatic "Differentiation" via Text (arXiv 2406.07496): https://arxiv.org/abs/2406.07496
- Alzubi et al., EvoSkill: Automated Skill Discovery for Multi-Agent Systems (arXiv 2603.02766): https://arxiv.org/abs/2603.02766
Related: Self-improving harnesses · Use as an agent skill · Context and memory · Evaluating agents · Evaluation integrity and anti-gaming · Governing self-modifying agents · Autonomous experimentation loops
-
SkillOpt (arXiv 2605.23904), Sections 3.1-3.4 and 4: skill as external state of a frozen model; rollout batch 40, reflection minibatch 8 with 16 analyst workers and up to 3 refinement rounds; hierarchical merge with failure priority; patch ops append/insert_after/replace/delete;
L_t = 4cosine-decayed to 2 with constant/linear/cosine/autonomous schedules; four epochs; 2:1:7 splits at split_seed=42; adapters for direct chat, Codex, and Claude Code harnesses. ↩↩↩ -
SkillOpt, Sections 3.5 and 4.2: acceptance requires strictly greater selection-split score (ties rejected);
edit_apply_report.jsonrecords per-edit accept/skip; rejected edits and their score drops feed an epoch-local buffer; the slow update writes into a protected SLOW_UPDATE region that step edits cannot touch, and still passes the gate. ↩↩ -
SkillOpt, Table 1 and Section 4.1: best or tied on 52/52 (model, benchmark, harness) cells; GPT-5.5 direct chat average 58.8 to 82.3 (+23.5), +5.4 over the per-cell oracle baseline; +24.8 on Codex and +19.1 on Claude Code over no skill; per-model direct-chat average gain about +17.6 across GPT-5.5/5.4/5.4-mini/5.4-nano/5.2, Qwen3.5-4B, Qwen3.6-35B-A3B; SearchQA 77.7 to 87.3, SpreadsheetBench 41.8 to 80.7, OfficeQA 33.1 to 72.1, DocVQA 78.8 to 91.2, LiveMathematicianBench 37.6 to 66.9, ALFWorld 83.6 to 95.5 on GPT-5.5. ↩↩↩
-
SkillOpt, Table 6 and Section 4.4: final skills 379 to 1,995 tokens (median about 920) from initial 16 to 516; 1 to 4 accepted edits per benchmark (OfficeQA +39.0 from one edit); training cost 0.6M to 46.4M tokens per absolute test point; learned rules are procedural, not instance-specific. ↩↩↩
-
SkillOpt, Tables 2-3 and Section 4.2: no-budget ("without lr") scores 84.6/75.7/57.3 vs bounded-budget defaults; removing the rejected-edit buffer costs 1.6/4.6/2.4 points on SearchQA/SpreadsheetBench/LiveMath; removing slow update plus meta skill drops SpreadsheetBench 77.5 to 55.0; results robust to rollout batch, minibatch, and schedule choices. ↩↩↩↩
-
SkillOpt, Tables 4-5 and Section 4.3: all cross-model rows positive (some transferred skills beat in-domain references); Codex to Claude Code SpreadsheetBench 22.1 to 81.8 (+59.7), reverse direction +43.6; OlympiadBench to Omni-MATH +3.7/+1.8/+1.3 across GPT-5.4/mini/nano; target-matched optimizer recovers 56 to 74 percent of the frontier optimizer's gain. ↩↩↩
-
SkillOpt, Appendix B: requires reliable feedback signals; training cost amortized only on reuse; single-skill design may not fit heterogeneous domains; learned heuristics are training-distribution specific and need held-out evaluation before transfer. ↩