Markdown

Runbook: untrusted reward-function onboarding¶

Scope: onboarding customer-supplied reward functions, verifiers, reward models, and environments onto a managed RL training service without letting tenant code reach the trainer process, other tenants' data, or platform credentials. Covers reward-type classification, the isolation boundary, resource and failure contracts, score-integrity gates, and the red-team verification pass.

Run this when custom rewards are about to be enabled on the service, when a tenant's reward code hangs or OOMs a run, or as the security review before GA. Severity: security-critical. Reward code is arbitrary tenant code executing adjacent to model weights, training data, and service credentials; a completions-plus-scores channel is also a data-exfiltration channel if egress is open.

Reference templates on real APIs; pin versions and validate before production use.

A reward function is the one place a training API executes logic the platform did not write, every training step, thousands of times per run. The isolation spectrum (process, container, gVisor syscall interception, Firecracker/Kata microVM) is covered in sandboxing and isolation; tenant separation on the GPU platform itself is security and multi-tenancy; what a well-designed reward looks like is reward design, and the verifier-based special case is RLVR. This runbook is the operational bridge: the contract that lets those rewards run without trusting them.

Trigger¶

Custom rewards are being enabled on a managed RL training API (new feature, new tenant tier, or GA security review).
A tenant's reward code misbehaved in production: a run stalled inside reward computation, a reward worker OOMed or forked itself to death, or reward latency became the training bottleneck.
A reward-integrity alarm fired: constant or degenerate scores, NaN rewards reaching the trainer, or a reward-hacking canary tripped (evaluation integrity).

Pre-checks¶

Confirm the trainer/reward process boundary exists at all. If reward code is imported into the trainer process, stop here: that is the finding. No resource cap or timeout inside a shared process protects the weights, the optimizer state, or the credentials in that process's environment.
Inventory what the reward path can currently see: environment variables, mounted volumes, network routes, service accounts. The target state is: its own batch of completions, nothing else.
Record the run-recovery contract before changing failure semantics: what happens to an in-flight run when reward computation fails today (checkpoint recovery).
Check the tenant tier. Reward-model uploads and stateful environments carry more surface than pure text-scoring functions; confirm the tier the tenant bought matches the reward type they submitted.

Flow¶

flowchart TB
    A["Tenant submits reward artifact"] --> B{"Classify: function, verifier,<br/>reward model, or environment?"}
    B --> C["Pick sandbox tier<br/>(container / gVisor / microVM)"]
    C --> D["Apply resource contract:<br/>CPU, memory, pids, timeout, no egress"]
    D --> E["Wire typed RPC boundary:<br/>completions in, scores out"]
    E --> F["Schema + integrity gates<br/>on every score batch"]
    F --> G{"Red-team checklist passes?"}
    G -->|"yes"| H["Enable for tenant, monitor per-cause failures"]
    G -->|"no"| I["Reject artifact with findings"]

Procedure¶

Classify the reward type; each step up widens the surface.
Pure function over (prompt, completion) text: the base case below suffices.
Verifier that executes further code (unit tests, a compiler, a proof checker, as in RLVR): the sandbox must contain arbitrary code the verifier itself runs, so treat it as hostile-code tier and give the verifier its own scratch filesystem that is destroyed per batch.
Reward model (tenant-uploaded weights): add artifact scanning and signature/digest pinning on the upload (container-image provenance has the pattern), plus a GPU-serving decision: tenant reward models get their own inference pool, never the trainer's GPUs.
Environment (stateful, multi-step, agentic): state persists across calls, so isolation must be per-run, not per-batch, and reset/teardown becomes part of the contract; the agent-side controls in the policy engine apply to any tools the environment exposes.
Isolate the execution. Reward code never imports into the trainer process. It runs in its own sandbox with a typed RPC boundary: the platform sends a batch of completions (read-only), the sandbox returns scores, nothing else crosses. Pick the tier from the isolation ladder: a locked-down container (non-root, no-new-privileges, seccomp profile, read-only rootfs) for lightly trusted tenants; gVisor or a Firecracker/Kata microVM for hostile-code tier (verifiers, environments, free-tier tenants). No GPU access by default. Reference pod-level template:

# Reference template: reward-sandbox pod (verify against your cluster version).
# Pod Security Standards "restricted" profile plus explicit caps.
securityContext:
  runAsNonRoot: true
  allowPrivilegeEscalation: false
  seccompProfile: { type: RuntimeDefault }
  capabilities: { drop: ["ALL"] }
resources:
  limits: { cpu: "2", memory: 2Gi, ephemeral-storage: 1Gi }
# runtimeClassName: gvisor   # or kata for the microVM tier

Enforce the resource contract. CPU and memory caps per sandbox; a pids cgroup limit so a fork bomb dies at the limit rather than at the node; a wall-clock timeout per score batch; a concurrency cap per tenant so one customer cannot monopolize the reward fleet. Network egress is off by default: an open egress path from code that sees every completion is an exfiltration channel. Rewards that legitimately need network (a web-search verifier) go through a documented exception: an egress allowlist proxy, per-tenant, logged.
Define failure semantics before the first failure. When reward code times out or crashes mid-run, the platform has three defensible policies: fail the batch and retry up to k times; mark the batch skipped and exclude it from the update; or pause the run for tenant action. Choose per product tier, but never silently substitute a constant (a zero reward is not neutral: it shifts the reward distribution and, under group-normalized methods like GRPO, rewrites every advantage in the group). Book every failure to a cause the customer can see: tenant timeout, tenant bad output, platform error. The distinction is what keeps reward-code faults out of the platform's error budget (training-platform SLOs).
Gate every score batch on schema and integrity. Validate shape, type, and finiteness before scores reach the trainer; track the rate of degenerate outputs (a constant reward is a dead gradient signal and usually a bug, not a preference); and keep reward-hacking canaries from evaluation integrity in the loop (a held-out probe set whose scores the tenant's code should not be able to inflate). The executed model below is the contract in miniature: deadline enforcement, schema rejection, and per-cause accounting (it validates the contract, not a product; real isolation lives in the sandbox tier above):

# reward_executor.py - executed: the reward-sandbox contract in miniature. A
# separate-process executor enforces a wall-clock deadline, validates the score
# schema before anything reaches the trainer, and books failures per cause so
# tenant reward-code faults are never billed as platform faults. Validates the
# contract (stdlib only); real isolation belongs to the sandbox tier, not here.
from __future__ import annotations

import math
import multiprocessing as mp
from typing import Callable, Optional

Batch = list[str]


def good_reward(batch: Batch) -> list[float]:
    return [float(len(c) % 7) / 7.0 for c in batch]


def nan_reward(batch: Batch) -> list[float]:
    return [float("nan")] * len(batch)


def hanging_reward(batch: Batch) -> list[float]:
    while True:
        pass


def wrong_shape_reward(batch: Batch) -> list[float]:
    return [0.5]                           # one score for a three-item batch


def _worker(fn: Callable[[Batch], list[float]], batch: Batch, q: "mp.Queue") -> None:
    q.put(fn(batch))


def validate(scores: object, n: int) -> list[float]:
    """Schema gate: a list of n finite floats, nothing else reaches the trainer."""
    assert isinstance(scores, list) and len(scores) == n, "wrong shape"
    out: list[float] = []
    for s in scores:
        assert isinstance(s, (int, float)) and math.isfinite(float(s)), "non-finite score"
        out.append(float(s))
    return out


def run_reward(fn: Callable[[Batch], list[float]], batch: Batch,
               timeout_s: float, ledger: dict[str, int]) -> Optional[list[float]]:
    """Deadline + schema + per-cause accounting. None = batch failed, caller
    decides retry/pause; scores are never silently substituted."""
    q: mp.Queue = mp.Queue()
    p = mp.Process(target=_worker, args=(fn, batch, q), daemon=True)
    p.start()
    p.join(timeout_s)
    if p.is_alive():                       # deadline: kill, book to the tenant
        p.terminate()
        p.join()
        ledger["tenant_timeout"] += 1
        return None
    try:
        scores = validate(q.get(timeout=1.0), len(batch))
    except Exception:                      # crash, empty queue, bad schema
        ledger["tenant_bad_output"] += 1
        return None
    ledger["ok"] += 1
    return scores


ledger = {"ok": 0, "tenant_timeout": 0, "tenant_bad_output": 0}
batch = ["completion a", "longer completion b", "c"]

scores = run_reward(good_reward, batch, timeout_s=5.0, ledger=ledger)
assert scores is not None and len(scores) == 3
assert all(0.0 <= s <= 1.0 and math.isfinite(s) for s in scores)

assert run_reward(hanging_reward, batch, timeout_s=0.5, ledger=ledger) is None
assert run_reward(nan_reward, batch, timeout_s=5.0, ledger=ledger) is None
assert run_reward(wrong_shape_reward, batch, timeout_s=5.0, ledger=ledger) is None

assert ledger == {"ok": 1, "tenant_timeout": 1, "tenant_bad_output": 2}, ledger
print("ledger:", ledger)
print("all reward-executor contract assertions passed")

Output of the run: ledger: {'ok': 1, 'tenant_timeout': 1, 'tenant_bad_output': 2} and all reward-executor contract assertions passed. The hanging function died at its deadline, the NaN and wrong-shape outputs never reached the trainer, and each failure landed in the tenant's column, not the platform's. 6. Scrub secrets and scope data. The sandbox environment carries no platform credentials, no service-account tokens beyond what the RPC boundary itself needs, and sees only the submitting tenant's run data. Reward-code logs are tenant-visible, so scrub platform internals from them; platform logs of reward calls scrub completion content unless the tenant opted into capture (identity and access for the authority model). If tenants ship reward code as container images, pin digests and require signatures (container-image provenance).

Verification¶

Red-team checklist on a staging tenant passes: a fork bomb dies at the pids limit, not the node; an infinite loop dies at the batch timeout; egress to an external host is refused; host paths and other tenants' data are absent from the sandbox filesystem; an oversized or malformed score payload is rejected at the schema gate.
Failure accounting is visible per tenant: induced timeout, crash, and bad-output failures land in distinct, customer-visible counters, and none of them page the platform on-call.
The trainer survives reward chaos: with a deliberately flaky staging reward (random hangs and crashes), runs pause or retry per the chosen policy, checkpoints stay resumable, and no NaN or constant-substituted rewards appear in the training telemetry.
Overhead is bounded: reward wall-clock per batch at the chosen sandbox tier stays within the step-time budget; if the sandbox tier added unacceptable latency, revisit the tier choice, not the boundary.

Rollback¶

Disable the tenant's custom reward at the feature-flag level; affected runs pause into a queued state rather than continuing with a substitute signal.
Revoke the sandbox artifact (image digest or uploaded package) so retries cannot re-materialize it, and re-admit only after the finding is fixed and re-verified.
Rotate anything that entered the blast radius: if a misconfiguration exposed credentials or another tenant's data to the sandbox, rotate those credentials and file the tenancy incident under security and multi-tenancy; the sandbox tier for that tenant only goes back up after the red-team checklist passes again.

the checkpoint-recovery runbook: resuming runs interrupted by reward-path failures.
the training-OOM runbook: when the memory pressure is in the trainer, not the reward sandbox.
operational runbooks: operational runbooks index.

References¶

gVisor documentation (user-space kernel, syscall interception): https://gvisor.dev/docs/
Kata Containers (microVM container runtime): https://katacontainers.io/
Firecracker microVM: https://firecracker-microvm.github.io/
Kubernetes Pod Security Standards (restricted profile): https://kubernetes.io/docs/concepts/security/pod-security-standards/
Kubernetes seccomp tutorial (RuntimeDefault and custom profiles): https://kubernetes.io/docs/tutorials/security/seccomp/
Linux cgroups v2 (pids, memory, cpu controllers): https://man7.org/linux/man-pages/man7/cgroups.7.html