Experiment tracking & model registry¶
Scope: the MLOps backbone for post-training. It tracks every run's params, metrics, and artifacts; versions models in a registry with promotion stages; and records the data → code → config → checkpoint lineage that makes a model reproducible and auditable. The record-keeping layer under fine-tuning and post-training and the eval gate; what turns "a checkpoint on someone's disk" into a governed, reproducible artifact.
The MLflow and W&B blocks below are reference templates on real APIs (neither library is vendored here); pin versions and validate before production use. The three numpy blocks (champion/challenger promotion, lineage content-addressing, sweep winner selection) are executed and asserted with the system python3, each including an adversarial or boundary case, and each validates the core math the API blocks above it rely on.
What it is¶
Two related systems, usually one tool:
- Experiment tracking logs, per run, the params (base model, learning rate, LoRA rank, dataset version, seed), the metrics (loss, and for RL also reward, KL, entropy, plus eval scores), the config, and the output artifacts (checkpoint/adapter). It is distinct from infra monitoring: tracking is about the run's learning, observability is about the cluster's health.
- A model registry is a versioned store of trained models with stages/aliases (e.g. staging → production, champion/challenger): the promotion record and the source of truth for what is deployable.
- Lineage ties them together: every registered model points back to the exact data, code commit, config, base model, and checkpoint that produced it.
MLflow (self-hostable; tracking + Model Registry) and Weights & Biases (hosted; tracking + Artifacts + Model Registry) are the standard tools.12
Why use it¶
- Reproducibility. A model you cannot reproduce is a liability; pinning data + code + config + seed + base is the only way to rebuild or debug it (SRE/MLOps practices). The lineage block below turns that tuple into a single content address so "the same run" is a checkable claim, not a memory.
- Comparability. Tracking makes runs comparable (which SFT mix, which DPO
beta, which merge recipe actually won) instead of relying on memory and scattered logs. - Governance. The registry plus the eval gate is the promotion control: an auditable record of what shipped, when, and why, with the ability to roll back to a prior version.
- Post-training specifics. RL runs need reward/entropy/KL as first-class tracked metrics to catch collapse (RLVR); every checkpoint should record its dataset version (data curation) and base model.
When to use it (and when not)¶
- Track every run you might promote or need to reproduce, which in practice is all of them.
- Even a solo finetune benefits: you will need to know which run produced the good checkpoint and on what data.
- A registry earns its keep once more than one model/version reaches serving, or multiple people/tenants share the platform; below that, tracking + tagged artifacts may suffice.
- Do not conflate it with monitoring. Experiment metrics (loss/reward/eval) and production telemetry (latency/errors/saturation) are different systems with different retention and consumers.
Architecture¶
flowchart LR
RUN["Training run"] -->|"params, metrics, artifacts"| TRK["Tracking server (MLflow / W&B)"]
DATA["Dataset version"] -.->|"lineage"| TRK
CODE["Code commit + config"] -.->|"lineage"| TRK
TRK --> BEST["Best run"]
BEST --> REG["Model registry (versioned)"]
REG --> GATE{"Eval gate"}
GATE -->|"pass → promote"| PROD["staging → production alias → serving"]
GATE -->|"fail"| RUN
The tracking server is the append-only ledger of runs; the registry is the promotable subset with aliases; the eval gate is the one edge that moves a version toward production. The three numpy blocks on this page validate the three load-bearing edges of that diagram: BEST → REG (winner selection), the GATE → PROD alias swap (champion/challenger), and the DATA/CODE ⇢ TRK lineage links (content-addressed reproducibility).
How to use it¶
MLflow tracks a run and registers the result. Reference template (needs mlflow):
# Reference template (requires mlflow; not run here). Pin the version.
# Track params/metrics/artifacts, then register the model.
import mlflow
mlflow.set_experiment("qwen3-8b-sft")
with mlflow.start_run() as run:
mlflow.log_params({"base": "Qwen/Qwen3-8B", "lr": 2e-5,
"lora_r": 16, "dataset_version": "sft-mix-v3", "seed": 0})
# ... training loop ...
mlflow.log_metrics({"train/loss": 0.82, "eval/mmlu": 0.64, "eval/gsm8k": 0.71})
mlflow.log_artifact("./adapter") # or mlflow.transformers.log_model(...)
# promote the run's model into the registry as a new version
mlflow.register_model(f"runs:/{run.info.run_id}/model", "qwen3-8b-sft")
Weights & Biases tracks and versions artifacts with lineage. Reference template (needs wandb):
# Reference template (requires wandb; not run here). Pin the version.
# Experiment tracking + artifact versioning.
import wandb
run = wandb.init(project="post-training",
config={"base": "Qwen/Qwen3-8B", "lr": 2e-5, "dataset_version": "sft-mix-v3"})
wandb.log({"train/loss": 0.82, "eval/mmlu": 0.64}) # stream metrics during training
art = wandb.Artifact("qwen3-8b-sft", type="model")
art.add_dir("./adapter")
run.log_artifact(art) # versioned, lineage-linked to this run
The value of these calls is not the logging line, it is the lineage each one attaches: a registered model must resolve back to the exact data + code + config + base + seed that made it. That reduction (a full provenance record to a single reproducibility fingerprint) is the core algorithm, and it is validated below with the standard library only, no tracking server required:
# Lineage as a content address, executed and asserted (numpy-only, stdlib hashlib).
# A registered model's reproducibility fingerprint is a hash over the FULL lineage
# (data version + code commit + resolved config + base model + seed). Two runs are
# "the same run" iff their fingerprints match. This is what makes a model rebuildable
# and detects silent drift.
# Asserts: identical lineage -> identical fingerprint (reproducible); ANY single-field
# drift (seed, dataset, config, base, code) -> a different fingerprint (drift/corruption
# detection); field order does not matter (canonicalization); and a run that logs
# metrics but omits a lineage field is REFUSED (metrics-without-lineage is unverifiable).
from __future__ import annotations
import hashlib, json
import numpy as np
REQUIRED = ("data_version", "code_commit", "config", "base_model", "seed")
def lineage_fingerprint(lineage: dict) -> str:
"""Canonical content address of a run's lineage. Refuses to fingerprint a run
that is missing any required provenance field (broken lineage is unverifiable)."""
missing = [k for k in REQUIRED if k not in lineage]
if missing:
raise ValueError(f"broken lineage, missing: {missing}")
canonical = json.dumps({k: lineage[k] for k in REQUIRED},
sort_keys=True, separators=(",", ":"))
return hashlib.sha256(canonical.encode()).hexdigest()
base = {"data_version": "sft-mix-v3", "code_commit": "a1b2c3d",
"config": {"lr": 2e-5, "lora_r": 16}, "base_model": "Qwen/Qwen3-8B", "seed": 0}
# reproducibility: an identical lineage produces an identical fingerprint
fp = lineage_fingerprint(base)
assert lineage_fingerprint(dict(base)) == fp, "identical lineage must reproduce the fingerprint"
# canonicalization: field insertion order and nested-key order do not change the address
reordered = {"seed": 0, "base_model": "Qwen/Qwen3-8B",
"config": {"lora_r": 16, "lr": 2e-5},
"code_commit": "a1b2c3d", "data_version": "sft-mix-v3"}
assert lineage_fingerprint(reordered) == fp, "fingerprint must be order-independent"
# drift detection: mutating ANY one lineage field changes the fingerprint. This is the
# corruption-detection / adversarial case: "same run" claims fail if provenance drifted.
for field, drifted in [("seed", 1),
("data_version", "sft-mix-v4"),
("code_commit", "deadbee"),
("base_model", "Qwen/Qwen3-14B"),
("config", {"lr": 3e-5, "lora_r": 16})]:
mutated = dict(base); mutated[field] = drifted
assert lineage_fingerprint(mutated) != fp, f"drift in {field} must change the fingerprint"
# collision sanity: 200 distinct seeds yield 200 distinct fingerprints (no aliasing)
fps = {lineage_fingerprint({**base, "seed": int(s)}) for s in np.arange(200)}
assert len(fps) == 200, "distinct lineages must not collide"
# metrics-without-lineage is refused: logging scores but dropping a provenance field
# cannot be registered as reproducible.
broken = {k: v for k, v in base.items() if k != "config"} # config dropped
try:
lineage_fingerprint(broken)
raise AssertionError("a run missing a lineage field must not fingerprint")
except ValueError as e:
assert "config" in str(e), "the refusal must name the missing field"
print(f"lineage: fp={fp[:12]} reproducible=True order_invariant=True "
f"drift_detected=5/5 no_collision=True broken_refused=True")
This is why the failure-modes section treats metrics without lineage as a broken run: a fingerprint that cannot be computed is a model that cannot be rebuilt.
How to develop with it¶
- Log the right things. Params: base, every hyperparameter, dataset version, seed. Metrics: loss plus reward/KL/entropy for RL and the full eval suite. Artifacts: the checkpoint/adapter, the resolved config, and a pointer to the dataset.
- Wire lineage, not just metrics. A run that logs scores but not its data/code version cannot be reproduced; record the dataset artifact (data curation) and the git commit. Concretely: log every field in
REQUIREDabove so the fingerprint is always computable. - Use registry stages/aliases. Move versions staging → production on eval-gate pass; keep champion/challenger aliases so rollback is one alias change.
- Standardize run naming and tags (method: sft/dpo/grpo/merge; base; dataset) so runs are queryable and comparable.
Comparing a sweep is only sound when every run was scored on the same eval suite, otherwise the "winner" may just be the run that got the easier eval. The winner-selection rule, with that comparability guard, is validated here:
# Sweep winner selection with a comparability guard, executed and asserted (numpy-only).
# Picking the best run of a hyperparameter sweep is only valid when every run was scored
# on the SAME eval suite (same fingerprint). The guard refuses to rank across mismatched
# suites, and rejects a "phantom winner" whose high score came from an easier/different eval.
# Asserts: the winner equals the brute-force argmax over the metric (equivalence to a slow
# reference); the objective direction is honored (loss-min vs acc-max); ranking across
# mismatched eval fingerprints is REFUSED; and a phantom winner scored on a different
# suite is rejected rather than crowned (adversarial). Vectorized selection matches an
# independently coded slow reference across a randomized sweep in both directions.
from __future__ import annotations
import numpy as np
def select_winner(runs, metric, *, maximize=True):
"""runs: list of dicts each with `eval_fingerprint` and a `metrics` dict.
Refuse to compare runs from different eval suites; else return the best run
on `metric` honoring the objective direction."""
fps = {r["eval_fingerprint"] for r in runs}
if len(fps) != 1:
raise ValueError(f"incomparable runs, {len(fps)} eval fingerprints: {sorted(fps)}")
scores = np.array([r["metrics"][metric] for r in runs], dtype=float)
idx = int(np.argmax(scores) if maximize else np.argmin(scores))
return runs[idx], scores[idx]
FP = "mmlu@abcd" # the one held-out suite every comparable run must share
def mk(name, mmlu, fp=FP):
return {"name": name, "eval_fingerprint": fp, "metrics": {"mmlu": mmlu}}
sweep = [mk("r0", 0.61), mk("r1", 0.66), mk("r2", 0.59), mk("r3", 0.64)]
# equivalence: the selected winner matches the brute-force argmax over the metric
win, sc = select_winner(sweep, "mmlu", maximize=True)
brute = max(sweep, key=lambda r: r["metrics"]["mmlu"])
assert win["name"] == brute["name"] == "r1" and sc == 0.66, "winner must equal brute-force argmax"
# direction: minimizing a loss metric picks the smallest, not the largest
loss_sweep = [{"name": n, "eval_fingerprint": FP, "metrics": {"loss": v}}
for n, v in [("a", 0.9), ("b", 0.4), ("c", 0.7)]]
lwin, _ = select_winner(loss_sweep, "loss", maximize=False)
assert lwin["name"] == "b", "lower-is-better must pick the minimum loss"
# comparability guard: a sweep mixing two eval suites cannot be ranked at all
mixed = [mk("r0", 0.61), mk("phantom", 0.99, fp="easyeval@zzzz")]
try:
select_winner(mixed, "mmlu", maximize=True)
raise AssertionError("ranking across mismatched eval fingerprints must be refused")
except ValueError as e:
assert "incomparable" in str(e)
# adversarial: the phantom (0.99 on a DIFFERENT, easier suite) must never be crowned over
# the honest cohort. Restricting to the shared-fingerprint cohort yields the true winner.
cohort = [r for r in mixed if r["eval_fingerprint"] == FP]
honest_win, honest_sc = select_winner(cohort, "mmlu", maximize=True)
assert honest_win["name"] == "r0" and honest_sc == 0.61, "phantom cross-suite score must not win"
assert honest_win["name"] != "phantom"
# equivalence under randomization: vectorized selection matches a pure-python slow
# reference over a random sweep, in BOTH objective directions
rng = np.random.default_rng(0)
for maximize in (True, False):
for _ in range(500):
vals = rng.uniform(0, 1, size=int(rng.integers(1, 12)))
rs = [{"name": i, "eval_fingerprint": FP, "metrics": {"m": float(v)}}
for i, v in enumerate(vals)]
w, _ = select_winner(rs, "m", maximize=maximize)
slow = (max if maximize else min)(range(len(vals)), key=lambda i: vals[i])
assert w["name"] == slow, "vectorized winner must match slow reference"
print(f"sweep: winner={win['name']}@{sc} equivalence=True direction_ok=True "
f"guard_refused=True phantom_rejected=True random_match=1000/1000")
Register only the winner; the losing runs stay in the tracking ledger for the record but never enter the registry.
How to integrate with it¶
Tracking and the registry sit at the seams of the post-training toolchain, so integration is mostly wiring three edges:
- Trainer to tracker. The training loop calls
log_params/log_metrics/log_artifact(MLflow) orwandb.log+log_artifact(W&B), or usesautologto capture framework metrics automatically.1 For SFT/DPO/GRPO, stream the RL-specific metrics (reward, KL, entropy) alongside loss so collapse is visible in the same run view (RLVR). - Data/code versioning to lineage. Data versioning (DVC, LakeFS) and code (git) supply the
data_versionandcode_commitfields; the tracker records them so the fingerprint above is computable. Git + a data-version store + the config is the lineage triangle every registered model must close. - Registry to the eval gate and serving. The eval gate reads a candidate version and, on pass, moves the
productionalias; the serving layer resolves that alias to fetch the checkpoint from object storage. The alias is the contract between "what the gate blessed" and "what serving loads", so both sides name the same alias.
How to run it in production¶
Make registration and promotion a pipeline step, not a manual afterthought, so the registry is always the source of truth. On eval-gate pass, the pipeline auto-registers the run's model and moves the production alias atomically:
# Reference template (requires mlflow; not run here). Pin the version.
# In the pipeline, after the eval gate passes, register + alias the champion.
import mlflow
mv = mlflow.register_model(f"runs:/{run_id}/model", "qwen3-8b-sft")
client = mlflow.MlflowClient()
client.set_registered_model_alias("qwen3-8b-sft", "production", mv.version) # atomic promotion
The correctness rule that call must obey is the champion/challenger contract: a new version takes the production alias only if it both passes the frozen gate and strictly beats the incumbent champion, otherwise the champion holds and rollback is a one-alias change. That decision is validated here, standard-library only:
# Champion/challenger promotion, executed and asserted (numpy-only). Models the
# registry alias swap: a challenger replaces the production champion ONLY if it
# passes the FROZEN eval gate AND strictly beats the champion on the gated metric.
# Asserts: a real winner is promoted, a self-reported score cannot game the gate
# (adversarial), a tie does not dethrone the incumbent (boundary), and a
# gate-failing challenger is refused even if it "beats" the champion.
from __future__ import annotations
import numpy as np
def evaluate_challenger(candidate, gate, *, maximize=True):
"""FROZEN gate: score the candidate on a held-out suite the run cannot edit.
Returns (passed, score). `gate` is a closure mapping candidate -> score."""
score = float(gate(candidate)) # frozen: run cannot self-report this
passed = (score >= gate.threshold) if maximize else (score <= gate.threshold)
return passed, score
def promote(champion_score, challenger, gate, *, maximize=True):
"""Return the production alias target: the challenger only if it PASSES the
gate AND strictly improves on the champion; otherwise the champion holds."""
passed, score = evaluate_challenger(challenger, gate, maximize=maximize)
better = score > champion_score if maximize else score < champion_score
return ("challenger" if (passed and better) else "champion"), score
# a frozen held-out eval: MMLU-style accuracy, higher is better, promote >= 0.60
def gate(candidate):
return {"a": 0.58, "b": 0.66, "flat": 0.61}[candidate]
gate.threshold = 0.60
# happy path: challenger "b" (0.66) passes the gate and beats champion (0.61) -> promoted
target, score = promote(champion_score=0.61, challenger="b", gate=gate)
assert target == "challenger" and score == 0.66, "real winner must be promoted"
# adversarial: a challenger that ADVERTISES a fake score is powerless, because the
# gate scores it itself. "a" self-claims 0.99 but truly scores 0.58 (< threshold).
class Cheat(str):
self_reported = 0.99
target_c, score_c = promote(champion_score=0.61, challenger=Cheat("a"), gate=gate)
assert target_c == "champion", "frozen gate must ignore a self-reported score"
assert score_c == 0.58 and score_c != Cheat.self_reported
# boundary: an exact tie does not dethrone the incumbent (strict improvement only).
# "flat" scores 0.61 == champion; it passes the gate but is not strictly better.
target_t, score_t = promote(champion_score=0.61, challenger="flat", gate=gate)
assert target_t == "champion" and score_t == 0.61, "a tie must not dethrone the champion"
# gate-failing-but-better: a candidate that would beat a LOW champion still cannot
# ship if it fails the absolute gate. Champion 0.50, challenger "a"=0.58 > 0.50 but
# 0.58 < 0.60 threshold -> refused. Promotion needs BOTH conditions, not either.
target_g, _ = promote(champion_score=0.50, challenger="a", gate=gate)
assert target_g == "champion", "a gate failure blocks promotion even if it beats the champion"
# direction flip: for a loss-style gate (lower is better) the same logic holds
def loss_gate(c):
return {"x": 0.9, "y": 0.4}[c]
loss_gate.threshold = 0.5 # promote only if loss <= 0.5
tgt_m, _ = promote(champion_score=0.6, challenger="y", gate=loss_gate, maximize=False)
assert tgt_m == "challenger", "lower-is-better gate must promote the smaller loss"
tgt_bad, _ = promote(champion_score=0.6, challenger="x", gate=loss_gate, maximize=False)
assert tgt_bad == "champion", "a high-loss candidate must be refused"
print(f"promote: winner=challenger@{score} cheat_ignored=True tie_held=True "
f"gate_blocks_beat=True direction_ok=True")
Give each team a project with RBAC, and keep the eval gate on the promotion edge so no version reaches production ungated (SRE/MLOps practices).
How to maintain it¶
- Back up the artifact store. Artifacts (checkpoints/adapters) live in object storage; losing that bucket loses the models even if the tracking database survives. The tracking DB holds pointers, not the weights, so both must be backed up.
- Keep lineage closed. Periodically assert that every
production-aliased version resolves to a computable fingerprint (allREQUIREDfields present) and that its dataset/commit still exist. A version whose data or code has been garbage-collected has broken lineage and can no longer be rebuilt. - Prune and retain deliberately. Experiment runs and production telemetry have different retention: keep the registered/aliased versions and their lineage indefinitely, but expire the long tail of losing runs and raw metric streams on a schedule so the store does not grow unbounded.
- Guard against seed/config drift. An unlogged seed or an unpinned config makes "the same run" irreproducible; the drift-detection assertions above are exactly the check to run in CI against a claimed reproduction.
How to scale it¶
Self-host MLflow (tracking server + a database + object storage for artifacts) or use hosted W&B; either way, artifacts live in object storage that must be backed up. Losing the artifact store loses the models. For a platform, give each team a project with RBAC, and make registration a pipeline step: on eval-gate pass, the pipeline auto-registers and promotes, so the registry is always the source of truth rather than a manual afterthought (SRE/MLOps practices). Data versioning (DVC, LakeFS) and code (git) complete the lineage triangle.
Cookbook (common use cases)¶
1. Auto-register on eval-gate pass: the pipeline template in how to run it in production registers the run's model and moves the production alias in one atomic step; the champion/challenger block there is the correctness rule that call obeys.
2. Compare a hyperparameter sweep: log each run with the same param schema and the same eval fingerprint, then select the winner with the guarded select_winner above (query/sort by eval/mmlu in the UI does the same thing interactively) and register only that version. The comparability guard is what stops a run that got an easier eval from winning by accident.
3. Reproduce a shipped model: from the registry version, read its lineage (dataset version + git commit + config) and re-run; recompute lineage_fingerprint on the rebuild and assert it equals the recorded one. A model that cannot be rebuilt to the same fingerprint has broken lineage.
Failure modes¶
- Untracked runs. The good checkpoint becomes unreproducible: no record of the data, code, or config that made it.
- Metrics without lineage. Logging scores but not the dataset/code version breaks reproduction; log both. Validated above: a lineage missing any
REQUIREDfield refuses to fingerprint. - Registry without an eval gate. Promotion becomes ungoverned; gate every promotion on evals. Validated above:
promotenever returns the challenger on a gate failure, even when it beats the champion. - Comparing across eval suites. Picking a sweep winner from runs scored on different suites crowns the run that got the easier eval; the comparability guard above refuses that ranking and rejects the phantom winner.
- Unbacked artifact store. If the object store holding checkpoints is not backed up, a bucket loss wipes the models.
- Tracking used as monitoring (or vice versa). Conflating experiment metrics with production telemetry gives both the wrong retention and the wrong alerts (observability).
- Seed/config drift. Unlogged seeds or an unpinned config make "the same run" irreproducible; the drift assertions above detect exactly this.
References¶
- MLflow (tracking + Model Registry): https://mlflow.org/docs/latest/
- Weights & Biases (tracking + Artifacts + Model Registry): https://docs.wandb.ai/
- DVC (data/model versioning): https://dvc.org/doc · LakeFS (data versioning): https://docs.lakefs.io/
Related: Fine-tuning and post-training · LLM evaluation harness · SRE/MLOps practices · Training-data curation · Storage & data platform · Observability · Model merging · RLVR · Glossary
-
MLflow, open-source experiment tracking (
log_params/log_metrics/log_artifact,autolog) plus a Model Registry with versioned models and stage/alias-based promotion and lineage. https://mlflow.org/docs/latest/ ↩↩ -
Weights & Biases, experiment tracking (
wandb.init/wandb.log), Artifacts for dataset/model versioning and lineage, and a Model Registry. https://docs.wandb.ai/ ↩