Markdown

Agentic and tool-use RL¶

Scope: reinforcement-learning post-training that teaches an LLM to act (to call tools, run code, and take multi-turn steps in an environment) rather than answer in one shot. The rollout becomes a ReAct trajectory (think, act, observe, repeat), the reward is whether the task succeeded, and the loop masks tool outputs from the loss. This is the agentic extension of GRPO and the post-training stack in fine-tuning and post-training; it stresses the rollout systems in async RL systems hardest.

Reference templates on real APIs (verl, TRL, sandboxes); pin versions and validate before production use. The three numpy blocks below (loss masking, trajectory reward and group-relative advantage, ReAct-loop control) are executed and asserted; the verl and E2B blocks are labelled reference templates.

What it is¶

Tool use is a trained skill: a model emits a structured request (a tool name and arguments), an orchestrator executes it, the result is appended to the context, and the model continues generating. RLHF and RLVR are how that behaviour is refined.¹ Agentic RL applies an online RL algorithm (GRPO, PPO) to multi-turn trajectories instead of single completions: each rollout interleaves model-generated reasoning and tool calls with environment-returned observations, ending in a verifiable outcome. The lineage runs from ToolFormer (a model teaching itself to call a few APIs)³ and Gorilla (1,600+ APIs)⁴ to today's reasoning agents trained to search, browse, and execute code.

The three terms the field uses interchangeably are worth separating:¹

Tool use: the model emits a structured request; an orchestrator runs it; the result re-enters the context (tools and function calling).
Function calling: tool use where the arguments must match a declared JSON Schema, so calls parse and validate reliably.
Code execution: a special case where the tool is a code interpreter (Python), letting the model escape its probabilistic nature and return exact answers.

Why use it¶

It removes structural weaknesses. A model cannot answer "who is president today?" from frozen weights, but one search call can. Code execution turns "approximate pi to 50 digits" from a hallucination risk into a deterministic result.¹
It is the frontier RL workload. Reasoning models are increasingly trained to interleave thinking with actions (the ReAct pattern, reasoning and acting in one model),² pushing capability past what single-turn GRPO reaches.
Verifiable rewards extend naturally. Many agentic tasks have a programmatic success signal (the test suite passes, the answer matches, the file landed in the right directory), so reward design stays clean without a learned reward model (RLVR).

When to use it (and when not)¶

Use agentic RL when the task genuinely needs multi-step interaction with an environment (search, code execution, browser/API use) and you have a checkable success signal per trajectory.
Prefer single-turn GRPO when the task is answer-in-one-shot (maths, short reasoning); multi-turn rollouts cost far more and add no value there.
Prefer SFT on tool-call traces first: cold-start the format (valid JSON tool calls, the think/act structure) with supervised traces before RL; RL refines a policy that already emits well-formed calls.
Avoid if you cannot sandbox tool execution safely or cannot define trajectory success. An ungrounded or hackable reward is worse here, where the model has more degrees of freedom to game it (reward design).

Architecture¶

The rollout is a ReAct loop (agent loop): the policy reasons about what it needs, emits an action (a tool call), the orchestrator executes it, the observation is appended, and the loop repeats until the model emits a final answer or a turn limit is hit.² The generation server (vLLM/SGLang) drives the model tokens; a separate tool executor runs the search, API, or sandboxed code; a trajectory scorer turns the finished rollout into a reward; and the policy update applies a group-relative advantage through a per-token loss mask that zeroes every injected observation span. Weight sync closes the loop back to the generator.

flowchart LR
  P["Prompt / task"] --> GEN["Policy generates: think + tool call"]
  GEN --> ORCH["Orchestrator executes tool<br/>(search / code sandbox / API)"]
  ORCH --> OBS["Observation appended to context"]
  OBS -->|"continue, up to max turns"| GEN
  GEN -->|"final answer"| RWD["Trajectory reward<br/>(task success, tests pass)"]
  RWD --> ADV["Group-relative advantage"]
  ADV --> UPD["Policy update<br/>(tool-output tokens MASKED from loss)"]
  UPD -->|"weight sync"| GEN

Compared with a single-turn rollout, three things change in the trajectory itself:

Variable, unbounded length. A trajectory may take one tool call or twenty; lengths vary wildly across a batch.
The environment is in the loop. Generation pauses on every tool call while the orchestrator runs a search, an API, or a code sandbox, then resumes.
Multiple calls per turn. A model can emit several tool calls in one generation step.¹

The key detail: mask tool outputs from the loss¶

The single most important agentic-RL implementation point: tool-call results (the observation tokens the orchestrator appends) are masked out of the policy-gradient loss. The model did not generate them (the environment did), so training on them would teach the policy to predict tool outputs, which is wrong and destabilising. You compute the loss only over the tokens the policy actually produced (its reasoning and its tool-call requests), with the observation spans masked.¹ Concretely the trajectory carries a per-token loss_mask that is 0 over every injected tool-output span and 1 over model-generated tokens; the advantage is applied through that mask. This numpy block builds that mask, computes the masked loss, and asserts the properties it must have (equivalence to a slow reference, immunity to corrupted observation tokens, that forgetting the mask changes the answer, and that an all-observation trajectory raises instead of dividing by zero):

# agentic_loss_mask.py -- masked policy-gradient loss for a ReAct trajectory. numpy only.
# Tool-output (observation) tokens are masked out: the environment produced them,
# not the policy, so training on them would teach the model to predict tool results.
import numpy as np

def masked_pg_loss(logp, advantage, loss_mask):
    """Per-token REINFORCE loss -A * logp, averaged over MODEL tokens only.
    loss_mask is 1 on policy-generated tokens, 0 on injected tool-output spans."""
    logp, advantage, loss_mask = map(np.asarray, (logp, advantage, loss_mask))
    per_tok = -advantage * logp * loss_mask
    denom = loss_mask.sum()
    if denom == 0:
        raise ValueError("no model-generated tokens to train on")
    return per_tok.sum() / denom

def reference_loss(logp, advantage, mask):
    """Slow, explicit reference: iterate and skip masked positions."""
    total, n = 0.0, 0
    for lp, a, m in zip(logp, advantage, mask):
        if m:
            total += -a * lp
            n += 1
    return total / n

# A trajectory: [think, tool_call, <OBS obs obs>, think, answer].
#                 1      1          0   0   0       1      1      <- loss_mask
logp = np.array([-0.10, -0.20, -5.0, -6.0, -7.0, -0.30, -0.40])
adv  = np.full(7, 0.5)                       # constant trajectory advantage
mask = np.array([1, 1, 0, 0, 0, 1, 1], dtype=float)

# (1) equivalence to the slow reference implementation.
fast, slow = masked_pg_loss(logp, adv, mask), reference_loss(logp, adv, mask)
assert np.isclose(fast, slow), (fast, slow)

# (2) masked tokens contribute EXACTLY zero: corrupting only the observation
#     logprobs (huge, adversarial values) must not change the loss at all.
corrupt = logp.copy()
corrupt[2:5] = np.array([-1e6, -1e6, -1e6])          # poison the tool-output span
assert np.isclose(masked_pg_loss(logp, adv, mask), masked_pg_loss(corrupt, adv, mask))

# (3) FAILURE case the page warns about: forgetting the mask (all ones) lets the
#     environment tokens dominate the loss, provably a different, wrong number.
no_mask = np.ones_like(mask)
assert not np.isclose(masked_pg_loss(logp, adv, mask), masked_pg_loss(logp, adv, no_mask))

# (4) boundary: a trajectory that is ALL observation (no model tokens) must raise,
#     not silently divide by zero.
try:
    masked_pg_loss(logp, adv, np.zeros_like(mask))
    raise AssertionError("expected ValueError on all-masked trajectory")
except ValueError:
    pass

print("masked loss:", round(float(fast), 4), "| unmasked (wrong):",
      round(float(masked_pg_loss(logp, adv, no_mask)), 4))

How to use it¶

Frameworks expose a multi-turn / agentic rollout mode that runs the ReAct loop and returns a masked trajectory. verl drives multi-turn GRPO with a tool/agent loop and vLLM rollouts on Ray; pin the release and confirm keys on the repo. This is a reference template (keys drift across releases); the numpy block under it validates the loop-control math the max_turns key governs.

# verl multi-turn / agentic GRPO. Reference template: illustrative keys, verify
# against the installed release before use.
python3 -m verl.trainer.main_ppo \
  algorithm.adv_estimator=grpo \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.rollout.multi_turn.enable=true \
  actor_rollout_ref.rollout.multi_turn.max_turns=8 \
  actor_rollout_ref.rollout.multi_turn.tool_config_path=tools.yaml

max_turns bounds the loop: the rollout ends when the policy emits a final answer or the cap is reached, and a rollout cut off at the cap is truncated, not a genuine failure. This numpy block reproduces that control flow and asserts the boundary (answer on the last allowed turn counts as done) and the failure case (a task needing more turns than the cap is flagged truncated):

# react_loop.py -- ReAct rollout control: terminate on final answer OR max_turns.
# Validates the core loop math the verl `multi_turn.max_turns` config governs.
# numpy only (deterministic policy stub stands in for the generation server).
import numpy as np

def rollout(policy, max_turns):
    """Run the think/act/observe loop. Returns (n_turns, done, truncated).
    `policy(turn) -> ("answer"|"tool")` decides each step; the orchestrator would
    execute the tool and append the observation between steps."""
    assert max_turns >= 1
    for turn in range(1, max_turns + 1):
        if policy(turn) == "answer":
            return turn, True, False          # emitted a final answer
    return max_turns, False, True             # hit the cap -> truncated, not done

# Policy that answers on turn k (needs k tool calls first).
def answers_on(k):
    return lambda turn: "answer" if turn >= k else "tool"

# (1) a short task terminates exactly when the answer is emitted, not truncated.
n, done, trunc = rollout(answers_on(3), max_turns=8)
assert (n, done, trunc) == (3, True, False)

# (2) BOUNDARY: answer lands on the very last allowed turn -> done, NOT truncated.
n, done, trunc = rollout(answers_on(8), max_turns=8)
assert (n, done, trunc) == (8, True, False)

# (3) FAILURE the page warns about: task needs more turns than the cap -> the loop
#     stops at max_turns and flags truncation, so the reward layer must not score
#     the cut-off trajectory as a genuine failure.
n, done, trunc = rollout(answers_on(20), max_turns=8)
assert (n, done, trunc) == (8, False, True)

# (4) a policy that never calls a tool answers on turn 1 (single-turn degenerates
#     to one step); an invalid max_turns=0 cap is rejected up front.
assert rollout(answers_on(1), 8) == (1, True, False)
raised = False
try:
    rollout(answers_on(1), 0)
except AssertionError:
    raised = True
assert raised, "max_turns=0 must be rejected"

print("turns to answer / done / truncated for cap=8, need=20:", (n, done, trunc))

How to integrate with it¶

The generation engine has to pause on each tool call, hand off to a tool executor, and resume, so integration is mostly wiring three components: the rollout engine (verl / TRL on vLLM/SGLang), a tool registry the model can call (tools and function calling), and a reward function that scores a finished trajectory. The reward is a trajectory scorer: for code tasks, run the model's final program in a sandbox and return pass/fail. The Python block below is a reference template (it depends on E2B / a container runtime and helper extractors); the runnable numpy block that follows validates the outcome-reward and group-relative-advantage math it computes.

# reward_agentic.py -- outcome reward for a code-execution trajectory.
# Reference template: run_in_sandbox / extract_final_code depend on E2B or a
# container runtime (never the host). Pin the sandbox client before use.
def reward_task_success(prompts, trajectories, tests, **kw):
    scores = []
    for traj, test in zip(trajectories, tests):
        program = extract_final_code(traj)           # the model's last code block
        passed = run_in_sandbox(program, test)        # E2B / container; never the host
        scores.append(1.0 if passed else 0.0)
    return scores

Outcome reward is all-or-nothing per trajectory; a group of trajectories for the same task is then normalised into advantages (GRPO), with no value network. This numpy block validates both, including the degenerate group (every rollout identical, so zero learning signal) and the truncation boundary (an empty trajectory scores 0.0, never a false success):

# agentic_reward.py -- trajectory outcome reward + group-relative advantage. numpy only.
# Outcome reward: run the agent's final program against tests, score 1.0 iff all pass.
# GRPO advantage: normalise a GROUP of trajectories for the same task; no value net.
import numpy as np

def outcome_reward(test_results):
    """test_results: list of per-trajectory bool arrays (one bool per unit test).
    Verifiable, outcome-based: reward 1.0 only when every test passes."""
    return np.array([1.0 if len(t) and all(t) else 0.0 for t in test_results])

def group_advantage(rewards):
    """GRPO: subtract the group mean and divide by group std (guarded)."""
    rewards = np.asarray(rewards, dtype=float)
    return (rewards - rewards.mean()) / (rewards.std() + 1e-8)

# A group of 4 rollouts for one coding task; each ran a 3-case test suite.
group = [
    [True, True, True],     # solved -> reward 1.0
    [True, False, True],    # partial fail -> 0.0 (outcome reward is all-or-nothing)
    [True, True, True],     # solved -> 1.0
    [False, False, False],  # wrong -> 0.0
]
r = outcome_reward(group)
assert r.tolist() == [1.0, 0.0, 1.0, 0.0]

adv = group_advantage(r)
# (1) advantages are zero-centred within the group (GRPO baseline property).
assert np.isclose(adv.mean(), 0.0, atol=1e-6)
# (2) a solved trajectory gets positive advantage, a failed one negative.
assert adv[0] > 0 and adv[3] < 0
# (3) equal-reward trajectories get equal advantage.
assert np.isclose(adv[0], adv[2])

# (4) DEGENERATE group the failure-modes section warns about: every rollout
#     identical (all pass or all fail) -> zero learning signal, must not NaN.
for degen in ([[True]*3]*4, [[False]*3]*4):
    a = group_advantage(outcome_reward(degen))
    assert np.all(np.abs(a) < 1e-4) and not np.any(np.isnan(a))

# (5) boundary: an EMPTY trajectory (no tests ran, e.g. turn-limit truncation
#     before any code) scores 0.0, never a false success.
assert outcome_reward([[]]).tolist() == [0.0]

print("rewards:", r.tolist(), "| advantages:", np.round(adv, 3).tolist())

Beyond the pure outcome signal, optional shaping helps but must be gated on correctness: format rewards (valid JSON tool calls), step penalties (discourage needless tool calls), and partial credit on sub-goals, gated so the model cannot farm the shaping term (reward design).

How to run it in production¶

Cold-start with SFT on tool-call traces, then run agentic GRPO. Agentic RL is the hardest case for the rollout systems in async RL systems, and production settings force two deployment decisions:

Sandbox every tool that runs model output. Code execution needs an isolated sandbox (a container or microVM per rollout) for safety; running model-generated code on the host is a security hole (agent sandboxing and isolation). Tool execution is itself a scheduled workload alongside the GPUs.
Orchestrate tools inside the rollout engine. The generation server (vLLM/SGLang) pauses, hands off to a tool executor, and resumes; the executor is on the critical path of every trajectory.

How to maintain it¶

Judge a run by trajectories, not a single reward number. On top of the usual RL health metrics (reward, entropy, KL), track the agentic ones: average turns per trajectory, tool-call validity rate, and tool-error rate (observability). Correlate training reward with a held-out eval (agent evaluation); a gap between climbing reward and stalled held-out quality is reward hacking, so re-audit the reward and the sandbox check (reward design). Watch the truncation rate too: if many trajectories hit max_turns, the cap is starving genuinely hard tasks and skewing the reward, so retune it (validated in the loop-control block above).

How to scale it¶

The three trajectory properties above (variable length, environment in the loop, multiple calls per turn) make scaling the rollout the dominant cost:

The straggler problem is worse. Reasoning trajectories already run long; add variable tool latencies (a slow search, a 30-second code run) and one trajectory in a batch can stall the whole synchronous step. Asynchronous, disaggregated rollout is close to mandatory at scale (async RL systems, disaggregated inference).
Long, masked sequences. Trajectories with many observation spans are long; sequence packing and the loss mask must be handled together so masked tokens do not waste compute or distort length normalisation.
Tool execution is a fleet, not a call. A sandbox per rollout at batch scale is its own capacity-planning problem; the tool executor and the GPUs contend for the schedule.

Failure modes¶

Forgetting to mask tool outputs: the loss trains on environment tokens; the policy degrades and learns to hallucinate tool results. Mask every observation span (validated above: forgetting the mask changes the loss).
Reward hacking via tools: the model games the success check (for example prints the expected answer without solving, or calls a tool that trivially satisfies the reward). Verify the reward adversarially (reward design).
Unsandboxed code execution: running model-generated code on the host is a security hole; always isolate (container/microVM) and time-limit it (agent sandboxing and isolation).
Straggler-bound rollouts: long, tool-laden trajectories idle the batch on a synchronous loop; move to async/disaggregated rollout (async RL systems).
Malformed tool calls: without an SFT cold-start the policy emits invalid JSON; the orchestrator cannot parse it and the trajectory wastes a turn. SFT the format first (SFT and LoRA).
Turn-limit truncation skews reward: trajectories cut off at the max-turn cap look like failures; tune the cap and account for truncation in the reward (validated above: the loop flags truncation so the scorer can discount it).
Degenerate groups: a group where every rollout passes or every rollout fails carries zero advantage and no learning signal; vary task difficulty so groups are informative (GRPO variants).

References¶

Reinforcement Learning from Human Feedback (Nathan Lambert, Manning MEAP), ch. 13 Tool Use and Function Calling (tool-use is a trained skill; tool-call and output tokens masked from the loss): https://rlhfbook.com
ReAct: Synergizing Reasoning and Acting in Language Models: https://arxiv.org/abs/2210.03629
Toolformer: Language Models Can Teach Themselves to Use Tools: https://arxiv.org/abs/2302.04761
Gorilla: Large Language Model Connected with Massive APIs: https://arxiv.org/abs/2305.15334
DeepSeekMath (GRPO): https://arxiv.org/abs/2402.03300
verl (multi-turn / agentic RL): https://github.com/volcengine/verl

Reinforcement Learning from Human Feedback (Manning MEAP), ch. 13: tool use = the model emits a structured request, an orchestrator executes it, and results are appended to the context; function calling constrains arguments to a JSON Schema; code execution is tool use where the tool is an interpreter; and during training the tool-call and output tokens are typically masked from the loss. ↩↩↩↩↩
Yao et al., ReAct interleaves reasoning traces and task actions in one model so the two reinforce each other. https://arxiv.org/abs/2210.03629 ↩↩
Schick et al., Toolformer, a model trained to decide which APIs to call, when, and how. https://arxiv.org/abs/2302.04761 ↩
Patil et al., Gorilla, an LLM trained to emit correct calls across 1,600+ APIs. https://arxiv.org/abs/2305.15334 ↩