Autonomous experimentation loops¶
Scope: the closed loop that automates ML experimentation. A proposer suggests the next experiment (a hyperparameter set or a source-code change), a bounded trial runs it, an evaluator returns a scalar, and the loop keeps the winner and proposes again. It covers the proposer spectrum (grid/random, Bayesian samplers, LLM-as-optimizer, LLM code mutation, hybrid), how trials are executed and cancelled, and how to keep a long campaign comparable. The training-automation counterpart to AI-assisted performance optimization and to self-improving harnesses; the evaluator side is evaluation integrity and the trial-cancellation side is learning-curve extrapolation.
The numpy blocks below (the loop keep/discard rule, the pruning contract, the comparability guard, and the discovery evaluator) are executed and asserted, including adversarial cases. The Optuna, Modal, and SkyPilot snippets are reference templates: pin versions and validate before production use.
What it is¶
An autonomous experimentation loop treats "run an ML experiment" as a repeatable action and closes a control loop around it: propose, run a bounded trial, evaluate, keep or discard, propose again, with every trial appended to a history the proposer can read. It is the same generate-then-verify-then-refine structure behind AI-assisted kernel work (AI-assisted performance optimization), lifted from single-kernel autotuning up to whole training runs. The proposer is the only part that changes across systems; the loop, the frozen evaluator, the keep/discard rule, and the trial infrastructure are shared machinery. Classical AutoML fixed the proposer to a sampler; the current wave lets an LLM propose, which extends the search space from a numeric grid to arbitrary source-code edits.
The two design decisions that dominate every system are what a proposal is (a point in a hyperparameter space, or an edit to the training code) and who owns the evaluator (never the proposer; see evaluation integrity).
Why use it¶
- It turns experimentation into an automated search. Once "run one experiment" is an action with a scalar reward, the whole campaign is a loop that runs unattended overnight instead of a human tweaking configs by hand.
- The proposer is swappable behind one interface. The same loop hosts a grid baseline, a Bayesian sampler, or an LLM diff-proposer, so you can start cheap and escalate without rewriting the harness.
- It widens the search space. Letting an LLM propose lifts search from a numeric grid to arbitrary source-code edits, reaching designs a sampler cannot express (new function bodies, new algorithms).
- It has documented wins. OPRO beat hand-designed prompts by up to 8% on GSM8K and up to 50% on Big-Bench Hard;2 FunSearch discovered new cap-set constructions and better bin-packing heuristics;1 AlphaEvolve reports gains on data-center scheduling, matrix-multiplication algorithms, and GPU kernels.3
- It is auditable and replayable. Every trial (kept or discarded) lands in a ledger with its inputs, score, and artifact, so a reported result is reproducible and the search itself is inspectable.
When to use it (and when not)¶
Match the proposer to what actually needs to change:
- Grid or random search when the space is a handful of numeric knobs. Random search is competitive with early Bayesian methods on high-dimensional spaces and is trivially parallel, so it is the honest baseline you must beat before claiming anything smarter helps.
- A Bayesian, bandit, or evolutionary sampler when the space is numeric but large enough that modeling the score surface pays off (HPO territory).
- An LLM over hyperparameters when the task is easier to describe in natural language than to grid, or when past configurations and their scores are useful context for the next guess.2
- An LLM over code when the thing that must change is the training code itself (a new function body, a new algorithm), not a number in a config.134
- Hybrid when both axes are available: start in the cheap regime and escalate to code mutation only when the param search stalls.
When not to reach for it, or a narrower variant:
- Do not skip the baseline. If a grid or random control has not been run, an LLM proposer's apparent gain is search budget, not intelligence. Whether an LLM proposer beats a strong Bayesian sampler is task-dependent and still contested; validate on the target workload, not a generic benchmark.
- Do not run code mutation without isolation. A code-mutating loop executes untrusted, model-written code; without a container or micro-VM per trial it is a remote-code-execution hazard, not a search.
- Mind the search-axis boundary. DSPy's optimizers use an LLM proposer but search over prompt instructions and few-shot demonstrations, not training hyperparameters or source code. Same LLM-as-proposer idea, different search axis; do not conflate the two.10
Architecture¶
The loop is a control cycle: a proposer draws the next candidate, a bounded trial executes it, a frozen evaluator scores it, the keep/discard rule updates the best, and the whole record feeds the next proposal. A doomed trial is cancelled early and still recorded. Only the proposer block changes across systems.
flowchart LR
subgraph PROP["Proposer (pick one)"]
S["Sampler<br/>grid / random / Bayesian / evolutionary"]
LP["LLM over params<br/>(reads history)"]
LD["LLM over code<br/>(unified diff)"]
end
PROP --> EXEC["Bounded trial<br/>(train under a budget)"]
EXEC --> EVAL["Evaluator<br/>(frozen metric)"]
EVAL --> KEEP{"Beats best?"}
KEEP -->|"yes"| ADOPT["Keep: version artifact"]
KEEP -->|"no"| DROP["Discard"]
ADOPT --> HIST["Experiment history / ledger"]
DROP --> HIST
HIST -->|"context for next proposal"| PROP
EXEC -.->|"doomed run"| CANCEL["Early-stop / cancel"]
CANCEL -.-> HIST
Proposers form a ladder from narrow-and-cheap to broad-and-expensive, and every rung plugs into the same downstream loop:
- Grid / random. Exhaustive combinations or seeded uniform draws. Strong, honest baselines; random search is competitive with early Bayesian methods on high-dimensional spaces and is trivially parallel.
- Bayesian / bandit / evolutionary samplers. TPE, CMA-ES, Gaussian-process Bayesian optimization, and population-based training model the score surface to propose the next point. This is the classical HPO core of Ray Tune and Optuna; the proposer is a numeric sampler, never a language model.7
- LLM over hyperparameters. The optimization task is described in natural language and the LLM proposes the next configuration from a prompt containing past configurations and their scores. OPRO ("LLM as optimizer") applied exactly this to linear regression, the travelling-salesman problem, and prompt strings, beating hand-designed prompts by up to 8% on GSM8K and up to 50% on Big-Bench Hard.2
- LLM over code. The proposal is a source-code change (a new function body or a unified diff against a mutable file) executed in the next trial. FunSearch evolves the body of a Python function against a frozen evaluator and a program database, and discovered new constructions for the cap-set problem and better bin-packing heuristics.1 AlphaEvolve generalizes this to an evolutionary agent that edits whole programs with a Gemini ensemble, reporting gains on data-center scheduling, matrix-multiplication algorithms, and GPU kernels.3 AIDE frames ML engineering itself as code optimization and runs a tree search over solutions, drafting, debugging, and benchmarking code toward a target metric.4
- Hybrid. Start in the cheap regime (param search) and escalate to code mutation only when the param search stalls, falling back if code proposals keep failing to apply. This spends LLM budget where a numeric sampler is exhausted, and is the practical default when both axes are available.
How to run the loop¶
Strip out the proposer and every system reduces to the same skeleton: draw a candidate, execute it under a budget, score it against a fixed metric, record the result, and let the record inform the next draw. Writing the loop this way keeps the proposer swappable: a grid baseline, a Bayesian sampler, and an LLM diff-proposer all satisfy the same interface.
The runnable block below is that skeleton, executed and asserted. It proves the kept winner equals the brute-force optimum of the search (equivalence to a slow reference), that best-so-far is monotone in the objective direction, and, as the adversarial case, that a proposal advertising a fake self-reported score cannot game the loop because scoring goes through a frozen evaluator the proposal cannot reach:
# Proposer-agnostic experimentation loop, executed and asserted (numpy-only).
# Asserts: kept winner == brute-force optimum (equivalence), best-so-far is
# monotone, and an evaluator-capture attempt (fake self-reported score) is ignored.
from __future__ import annotations
import numpy as np
def run_loop(candidates, evaluate, *, maximize):
"""Draw candidates in order, score each with the frozen evaluator, keep the
winner honoring the objective direction. Returns (best_cand, best_score, history)."""
better = (lambda s, b: b is None or s > b) if maximize else (lambda s, b: b is None or s < b)
best_cand, best_score, history = None, None, []
for cand in candidates:
score = float(evaluate(cand)) # FROZEN metric: a closure the proposal cannot edit
kept = better(score, best_score)
if kept:
best_cand, best_score = cand, score
history.append((cand, score, kept))
return best_cand, best_score, history
rng = np.random.default_rng(0)
grid = rng.uniform(-5.0, 5.0, size=200)
def objective(x): # peaked at x = 1.3
return -((x - 1.3) ** 2)
# happy path: the kept winner equals the brute-force argmax (equivalence check)
best_cand, best_score, history = run_loop(grid, objective, maximize=True)
brute = grid[int(np.argmax([objective(x) for x in grid]))]
assert best_cand == brute, "kept winner must equal brute-force optimum"
assert best_score == objective(brute)
# monotonicity: best-so-far is strictly increasing on kept trials, never on drops
running = -np.inf
for _, score, kept in history:
if kept:
assert score > running, "a kept trial must strictly improve best-so-far"
running = score
else:
assert score <= running, "a discarded trial must not beat best-so-far"
# direction flip: the same loop minimizing must find the true minimum
mn_cand, _, _ = run_loop(grid, lambda x: (x - 1.3) ** 2, maximize=False)
assert mn_cand == grid[int(np.argmin([(x - 1.3) ** 2 for x in grid]))]
# adversarial: an evaluator-capture attempt. A candidate that advertises a fake
# score is powerless, because run_loop scores it with the frozen evaluator.
class CheatCandidate(float):
fake_score = 1e9
honest = run_loop([CheatCandidate(2.0)], objective, maximize=True)[1]
assert honest == objective(2.0), "frozen evaluator must ignore self-reported score"
assert honest != CheatCandidate.fake_score
print(f"kept={best_cand:.4f} score={best_score:.6f} brute={brute:.4f} "
f"equivalence=True monotone=True capture_ignored=True")
The two design decisions surface directly in this code: evaluate is a frozen closure the proposal cannot touch, and candidates may equally be numeric points or code diffs, since run_loop never inspects their internals.
How to choose and integrate a proposer¶
Pick a rung of the ladder above by what needs to change, then plug it in behind the proposer.propose(history) interface. The high-value integration is the LLM-over-code proposer, because it is the one that can invent an algorithm rather than tune a number, and it is also the one whose output must be certified before it is trusted.
The core contract there is: a code-mutating loop keeps an evolved program only when the frozen evaluator certifies it is correct against a reference and a genuine improvement on the cost being minimized. The runnable block below validates exactly that contract on a concrete AlphaEvolve-style artifact, a 2x2 matrix multiply that uses 7 scalar multiplications instead of the naive 8 (Strassen). It asserts the discovered scheme reproduces the naive product over random matrices (equivalence), that the evaluator accepts it because it is correct and cheaper, and, adversarially, that a faster-but-incorrect candidate is rejected no matter how cheap it claims to be:
# Validates the CORE math of an LLM-over-code proposal (FunSearch / AlphaEvolve):
# a discovery is kept only if a FROZEN evaluator certifies it correct against a
# reference AND cheaper. Concrete artifact: a 2x2 matmul using 7 mults, not 8.
# Asserts equivalence to the naive product, acceptance of the real discovery, and
# rejection of a faster-but-wrong candidate. numpy-only, executed and asserted.
from __future__ import annotations
import numpy as np
def naive_2x2(A, B): # reference: 8 scalar multiplications
a, b, c, d = A[0, 0], A[0, 1], A[1, 0], A[1, 1]
e, f, g, h = B[0, 0], B[0, 1], B[1, 0], B[1, 1]
return np.array([[a * e + b * g, a * f + b * h],
[c * e + d * g, c * f + d * h]]), 8
def strassen_2x2(A, B): # "discovered" candidate: 7 multiplications
a, b, c, d = A[0, 0], A[0, 1], A[1, 0], A[1, 1]
e, f, g, h = B[0, 0], B[0, 1], B[1, 0], B[1, 1]
m1 = (a + d) * (e + h); m2 = (c + d) * e; m3 = a * (f - h); m4 = d * (g - e)
m5 = (a + b) * h; m6 = (c - a) * (e + f); m7 = (b - d) * (g + h)
return np.array([[m1 + m4 - m5 + m7, m3 + m5],
[m2 + m4, m1 - m2 + m3 + m6]]), 7
def wrong_fast(A, B): # adversarial: cheaper but incorrect
return A * B, 4 # elementwise, not a matmul
def frozen_evaluator(candidate, incumbent_cost, trials=2000):
"""The evaluator the proposer cannot edit: certify correctness vs the naive
reference over random inputs, then require a strictly lower cost."""
rng = np.random.default_rng(7)
for _ in range(trials):
A = rng.standard_normal((2, 2)); B = rng.standard_normal((2, 2))
got, cost = candidate(A, B)
ref, _ = naive_2x2(A, B)
if not np.allclose(got, ref, rtol=1e-12, atol=1e-12):
return False, cost # incorrect -> reject regardless of speed
return cost < incumbent_cost, cost
# equivalence: the 7-mult scheme reproduces the naive product exactly (to fp tol)
rng = np.random.default_rng(0)
for _ in range(10000):
A = rng.standard_normal((2, 2)); B = rng.standard_normal((2, 2))
fast, _ = strassen_2x2(A, B); ref, _ = naive_2x2(A, B)
assert np.allclose(fast, ref, rtol=1e-12, atol=1e-12), "discovered scheme must match reference"
# the frozen evaluator accepts the real discovery (correct AND 7 < 8)
accept, cost = frozen_evaluator(strassen_2x2, incumbent_cost=8)
assert accept is True and cost == 7, "correct-and-cheaper discovery must be accepted"
# adversarial: a faster-looking but WRONG candidate is rejected by the evaluator,
# even though its self-claimed cost (4) beats the incumbent
rej, _ = frozen_evaluator(wrong_fast, incumbent_cost=8)
assert rej is False, "evaluator must reject an incorrect candidate no matter how cheap"
# boundary: a correct candidate that is NOT cheaper is not an improvement
same, same_cost = frozen_evaluator(naive_2x2, incumbent_cost=8)
assert same is False and same_cost == 8, "no strict improvement -> not adopted"
print(f"discover: equivalence=True accepted_cost={cost} wrong_rejected=True "
f"no_improvement_rejected=True")
A useful scope boundary when wiring proposers: DSPy's optimizers use an LLM proposer but search over prompt instructions and few-shot demonstrations, not training hyperparameters or source code. Same LLM-as-proposer idea, different search axis.10
How to execute and cancel trials¶
Each trial is a bounded unit of work: a training run under a wall-clock or step budget that emits a metric. Two execution details recur.
Per-trial isolation. Code-mutating loops run untrusted, model-written code, so each trial wants an isolated execution context: a container or micro-VM per trial. Serverless GPU platforms make each trial a fresh GPU container (gpu='A100'-style) with its own image and lifecycle;8 multi-cloud orchestrators provision (or reuse) a cluster per job from a resource spec.9 This "per-trial container on a rented GPU" pattern is what lets a campaign fan out across capacity it does not own (GPU consumption models, provider landscape).
Cooperative cancellation. When a trial is forecast to lose, stopping it cooperatively (the trial polls a control signal and exits at a safe point) is cleaner than kill -9: it lets the trial flush partial results and frees the slot deterministically. This is the Optuna pruning contract (the trial calls report() then should_prune() and raises) generalized to a remote container, and the decision itself comes from learning-curve extrapolation.7
The runnable block below validates the pruning decision itself: the Optuna-style median rule that keeps trials beating the rung's median and tells the rest to exit. It asserts a doomed trial is pruned, a clear winner survives, the boundary trial exactly at the median is kept (not pruned), a warm-up or empty rung never prunes blind, and the vectorized rule matches an independently coded slow reference across a randomized sweep in both objective directions:
# Cooperative-cancellation / pruning contract, executed and asserted (numpy-only).
# Optuna-style median rule at one rung: prune a trial strictly worse than the
# median of completed peers; keep it otherwise; never prune blind. Asserts a
# doomed trial is pruned, a winner survives, the exact-median boundary is kept,
# warm-up/empty rungs are safe, and it matches a hand-rolled reference both ways.
from __future__ import annotations
import numpy as np
def should_prune(value, peers, *, maximize, warmup_ok):
if not warmup_ok or len(peers) == 0:
return False # no basis yet: never prune blind
med = float(np.median(peers))
return value < med if maximize else value > med # strictly worse than median -> prune
def reference_median(xs): # hand-rolled median, no np.median call
s = sorted(float(x) for x in xs); n = len(s); mid = n // 2
return s[mid] if n % 2 else (s[mid - 1] + s[mid]) / 2.0
def reference_prune(value, peers, maximize, warmup_ok): # independently-coded oracle
if not warmup_ok or len(peers) == 0:
return False
med = reference_median(peers)
return value < med if maximize else value > med
peers = np.array([0.30, 0.40, 0.50, 0.60, 0.70]) # median = 0.50
assert should_prune(0.31, peers, maximize=True, warmup_ok=True) is True # doomed: pruned
assert should_prune(0.69, peers, maximize=True, warmup_ok=True) is False # winner: survives
assert should_prune(0.50, peers, maximize=True, warmup_ok=True) is False # boundary == median: kept
assert should_prune(0.01, peers, maximize=True, warmup_ok=False) is False # warm-up: safe
assert should_prune(0.01, np.array([]), maximize=True, warmup_ok=True) is False # empty: safe
rng = np.random.default_rng(1) # equivalence to the slow reference
for _ in range(5000):
k = int(rng.integers(1, 12)); pr = rng.uniform(0.0, 1.0, size=k)
v = float(rng.uniform(0.0, 1.0)); mx = bool(rng.integers(0, 2))
assert should_prune(v, pr, maximize=mx, warmup_ok=True) == reference_prune(v, pr, mx, True)
loss_peers = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) # loss: lower is better, median = 3.0
assert should_prune(4.9, loss_peers, maximize=False, warmup_ok=True) is True
assert should_prune(1.1, loss_peers, maximize=False, warmup_ok=True) is False
print("prune: doomed=pruned winner=kept boundary=kept warmup=safe empty=safe "
"reference_equiv=True both_directions=True")
In a real HPO framework you do not hand-write this; Optuna implements it. The pruning contract is a reference template (Optuna is not vendored here; pin the version). The trial code raises, rather than the framework killing it:
# Optuna pruning contract (reference template; pin optuna, validate before prod).
# The CORE median-rule math this relies on is validated by the numpy block above.
import optuna
def objective(trial):
lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
model = build_model(lr)
for step in range(max_steps):
val = train_one_rung(model, step) # partial metric at this rung
trial.report(val, step) # publish to the pruner
if trial.should_prune(): # median/ASHA decision across peers
raise optuna.TrialPruned() # trial exits cooperatively; no kill -9
return final_metric(model)
study = optuna.create_study(direction="maximize", pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=100)
Diff-proposing and hybrid proposers usually stay serial even when param trials run K-at-a-time, because concurrent code edits fight over the same mutable file; batch the LLM into one call that returns K diverse proposals rather than K racing edits.
How to run it in production¶
Choose the trial runtime by the capacity you have. On serverless GPU, each trial is a fresh container with its own image and lifecycle, so an untrusted model-written trial cannot touch the host or its siblings. On rented clusters, a multi-cloud orchestrator provisions or reuses a cluster per job from a resource spec, which is what lets a campaign fan out across capacity it does not own. Both are reference templates; pin the platform versions.
# Modal: one fresh GPU container per trial (reference template; pin modal).
import modal
app = modal.App("experiment-loop")
@app.function(gpu="A100", timeout=3600) # per-trial isolation + a hard wall-clock cap
def run_trial(candidate: dict) -> float:
artifact = train_under_budget(candidate) # untrusted model-written code runs HERE, not on the host
return evaluate(artifact) # scalar back to the driver
# SkyPilot: provision or reuse a cluster per job from a resource spec
# (reference template; pin skypilot). One trial per launched task.
resources:
accelerators: A100:8
run: |
python trial.py --candidate "$CANDIDATE_JSON"
Two production guardrails that the templates above make room for but do not enforce on their own:
- Cap every trial and the whole campaign. Per-trial cloud GPUs with no wall-clock or spend cap turn an open-ended search into an open-ended bill; the
timeout=3600above bounds one trial, and a campaign-level budget (max-trials, max wall-clock, or no-improvement patience) bounds the run. - Keep code-mutating trials isolated and the evaluator out of reach. The container is the security boundary for untrusted model-written code; the evaluator and its gold data live outside the proposer's blast radius (evaluation integrity).
How to keep a long campaign comparable¶
A campaign that runs for days across heterogeneous rented GPUs will silently compare runs that are not comparable: different hardware, different budgets, a shifted dataset. Three guards keep it honest.
- Record a hardware and budget fingerprint per trial and refuse (or flag) comparisons across mismatched fingerprints; a "best" that only won because it ran on a faster GPU or a longer budget is not a real winner.
- Persist the winner as a versioned artifact (params/diff, metrics, and the model path) so any reported result is replayable. Goodput is the honest throughput metric here: useful, comparable results per GPU-hour, not raw trials attempted.
- Log discarded trials, not just kept ones. The discard record is what a smart proposer learns from and what makes the run auditable.
The runnable block below validates the comparability guard. It asserts that a same-fingerprint cohort ranks to its true winner, that ranking across mismatched fingerprints is refused, that the guard is symmetric in argument order, and, as the adversarial case, that a phantom winner (a lower score that only looks competitive because it ran on a bigger budget or a faster GPU) is rejected rather than silently crowned:
# Hardware/budget fingerprint comparability guard, executed and asserted (numpy).
# Asserts: same-fingerprint cohort ranks to its true winner; ranking across
# mismatched fingerprints is REFUSED; the phantom winner (worse score, bigger
# budget or faster GPU) is rejected; and the guard is symmetric in order.
from __future__ import annotations
from dataclasses import dataclass
@dataclass(frozen=True)
class Trial:
score: float
gpu: str
max_steps: int # the compute budget the score was earned under
def comparable(a, b): # same hardware AND same budget
return a.gpu == b.gpu and a.max_steps == b.max_steps
def pick_winner(trials, *, maximize):
"""Rank only within a single comparable cohort. Raise if the pool mixes
fingerprints, rather than reporting a winner across incomparable trials."""
ref = trials[0]
if not all(comparable(ref, t) for t in trials):
raise ValueError("incomparable fingerprints: refusing to rank across GPU/budget")
key = max if maximize else min
return key(trials, key=lambda t: t.score)
# honest cohort: all A100, all 1000 steps -> the top score wins
cohort = [Trial(0.71, "A100", 1000), Trial(0.83, "A100", 1000), Trial(0.68, "A100", 1000)]
assert pick_winner(cohort, maximize=True).score == 0.83
# adversarial phantom: a lower score that only "wins" because it ran 4x longer
phantom = Trial(0.80, "A100", 4000)
refused = False
try:
pick_winner(cohort + [phantom], maximize=True)
except ValueError:
refused = True
assert refused, "must refuse to rank a bigger-budget trial against the cohort"
# adversarial phantom #2: a faster GPU makes an inferior score look competitive
refused_gpu = False
try:
pick_winner(cohort + [Trial(0.82, "H100", 1000)], maximize=True)
except ValueError:
refused_gpu = True
assert refused_gpu, "must refuse to rank across GPU types"
# symmetry: the verdict cannot depend on argument order
assert comparable(cohort[0], phantom) == comparable(phantom, cohort[0]) == False
assert comparable(cohort[0], cohort[1]) is True
# within its OWN comparable cohort the phantom is a legitimate winner
big_cohort = [phantom, Trial(0.90, "A100", 4000)]
assert pick_winner(big_cohort, maximize=True).score == 0.90
print("fingerprint: cohort_winner=0.83 phantom_budget_refused=True "
"phantom_gpu_refused=True symmetric=True own_cohort_ok=True")
How to maintain and scale it¶
- Keep the loop proposer-agnostic. Adding a proposer (a new sampler, a stronger model) should be a swap behind the one interface, never a rewrite of the loop, the evaluator, or the ledger.
- Scale param trials wide, keep diff proposers serial. Param search fans out K-at-a-time across per-trial containers; diff and hybrid proposers stay serial and batch the LLM into one call that returns K diverse proposals, so concurrent edits never collide on the mutable file.
- Re-baseline as you scale. A larger fleet lets more trials run, which inflates any no-baseline "win"; the grid/random control has to scale with the campaign so the bar you clear stays honest.
- Watch the ledger for drift. Over a long campaign the held-out set can leak into training across iterations, and fingerprints can diverge as the fleet churns; the discard log and the fingerprint guard are what surface both before they poison the reported winner.
Don't-miss checklist¶
- Write the loop proposer-agnostic; keep the proposer swappable behind one interface.
- The proposer must never own or be able to edit the evaluator (evaluation integrity).
- Start with a grid/random baseline before claiming an LLM proposer helps; beating random is the real bar.
- Isolate every code-mutating trial (container / micro-VM); never run model-written code on the host.
- Bound every trial (wall-clock or steps) and cancel doomed ones cooperatively, not with a hard kill.
- Keep diff/hybrid proposals serial; parallelize param trials with one batched proposal call.
- Fingerprint hardware and budget per trial; refuse cross-incomparable "best" claims.
- Persist discarded trials and versioned winners so the campaign is replayable and auditable.
Failure modes¶
- No baseline. An LLM proposer is declared a win without a random-search control; the apparent gain is search budget, not intelligence.
- Evaluator capture. A code-mutating proposer edits, imports, or reads the scorer/gold data and games the metric, the headline risk of the whole pattern (evaluation integrity).
- Unbounded trials. Without a budget and early-stop, a few slow trials dominate wall-clock and the loop crawls (learning-curve extrapolation).
- Parallel diff races. K concurrent code edits collide on the mutable file; keep diff proposers serial.
- Incomparable comparisons. Trials on different GPUs/budgets are ranked together; the ledger reports a phantom winner.
- Contamination drift. A long campaign's held-out set leaks into training over iterations, inflating scores (evaluation integrity).
- Cost blow-out. Per-trial cloud GPUs with no wall-clock or spend cap turn an open-ended search into an open-ended bill.
Open questions & validation¶
- Whether an LLM proposer beats a strong Bayesian sampler is task-dependent and still contested; validate on the target workload, not a generic benchmark. MLE-bench found the best agent scaffold reached a Kaggle bronze medal in only 16.9% of 75 competitions: real capability, but far from solved.5
- How far to let a loop mutate code before a human reviews is unsettled; open-ended self-editing raises the governance questions in self-improving harnesses.
- Reported gains from autonomous research systems (including end-to-end "AI scientist" pipelines that write code, run experiments, and draft papers) should be treated as candidates pending independent reproduction, not as settled results.6 Verify claimed improvements end-to-end before promoting them.
References¶
- FunSearch, Romera-Paredes et al., "Mathematical discoveries from program search with large language models," Nature 625 (2024): https://www.nature.com/articles/s41586-023-06924-6 · code: https://github.com/google-deepmind/funsearch
- OPRO, Yang et al., "Large Language Models as Optimizers," arXiv:2309.03409: https://arxiv.org/abs/2309.03409 · code: https://github.com/google-deepmind/opro
- AIDE, Jiang et al. (Weco AI), "AIDE: AI-Driven Exploration in the Space of Code," arXiv:2502.13138: https://arxiv.org/abs/2502.13138 · code: https://github.com/WecoAI/aideml
- AlphaEvolve, Google DeepMind, "AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms" (2025): https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/
- MLE-bench, Chan et al. (OpenAI), "MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering," arXiv:2410.07095: https://arxiv.org/abs/2410.07095 · code: https://github.com/openai/mle-bench
- The AI Scientist, Lu et al. (Sakana AI), arXiv:2408.06292: https://arxiv.org/abs/2408.06292
- Optuna, Akiba et al., "Optuna: A Next-generation Hyperparameter Optimization Framework," KDD 2019, arXiv:1907.10902: https://arxiv.org/abs/1907.10902
- DSPy (prompt/demo optimization): https://github.com/stanfordnlp/dspy
- Modal (per-function GPU containers): https://modal.com/docs/guide/gpu · SkyPilot (multi-cloud provisioning): https://docs.skypilot.co/en/latest/getting-started/quickstart.html
Related: AI-assisted performance optimization · Evaluation integrity & anti-gaming · Learning-curve extrapolation & early stopping · Self-improving harnesses · Reward design for RL · GRPO post-training recipe · RL libraries · GPU consumption models · Glossary
-
Romera-Paredes et al., FunSearch pairs a pretrained LLM that proposes a Python function body with a frozen systematic evaluator and a programs database sampled back into the prompt; it found new cap-set constructions and improved bin-packing heuristics. Nature 625 (2024). ↩↩↩
-
Yang et al., OPRO describes the optimization task in natural language and has the LLM generate new solutions from a prompt containing prior solutions and their scores; best prompts beat human-designed ones by up to 8% (GSM8K) and up to 50% (Big-Bench Hard). arXiv:2309.03409. ↩↩↩
-
Google DeepMind, AlphaEvolve orchestrates a Gemini-model ensemble to make direct code changes under an evolutionary loop, reporting improvements to data-center scheduling, matrix-multiplication algorithms, and GPU kernels (2025). ↩↩↩
-
Jiang et al., AIDE frames ML engineering as code optimization and runs a tree search over solutions, drafting/debugging/benchmarking toward a user-defined metric; reports state-of-the-art on Kaggle, MLE-bench, and RE-Bench. arXiv:2502.13138. ↩↩
-
Chan et al., MLE-bench curates 75 Kaggle ML-engineering competitions; the best setup (o1-preview with AIDE scaffolding) reached at least a bronze medal in 16.9% of competitions, and the paper studies resource scaling and pre-training contamination. arXiv:2410.07095. ↩
-
Lu et al., The AI Scientist runs an end-to-end idea, code, experiment, paper, automated-review loop across ML subfields; treat its outputs as candidates pending independent review. arXiv:2408.06292. ↩
-
Akiba et al., Optuna's define-by-run API pairs samplers (TPE/CMA-ES) with pruners; the cooperative-cancellation contract is
trial.report(value, step)thenif trial.should_prune(): raise TrialPruned(), so the user's trial code raises rather than the framework killing it. KDD 2019, arXiv:1907.10902. ↩↩ -
Modal exposes a GPU container per function call (
@app.function(gpu='A100')), handling image, cold-start, and lifecycle, which is the per-trial runtime an experimentation loop targets: https://modal.com/docs/guide/gpu. ↩ -
SkyPilot provisions or reuses a cluster from a resource spec (
resources: {accelerators: A100:8}) across clouds, a per-job runtime target for a search loop: https://docs.skypilot.co/en/latest/getting-started/quickstart.html. ↩ -
DSPy optimizers (e.g. MIPROv2, GEPA) use an LLM to propose prompt instructions and few-shot demonstrations, an LLM proposer whose search axis is text components, not training hyperparameters or source code: https://github.com/stanfordnlp/dspy. ↩↩