Skip to content
Markdown

Tools and function calling

Scope: how an agent does anything beyond emitting text. The model never executes a tool itself; it emits a structured request, the harness executes it, and the result re-enters the context. This page covers the tool-calling mechanism, how to define and design tools, the Model Context Protocol for sharing them, and code execution as the general-purpose escape hatch. Tools are invoked from the agent loop, constrained by the harness and the policy gate, and sandboxed per isolation.

Code and schemas here are reference templates; pin versions and validate before relying on them.

flowchart LR
  DEF["1. Tool definitions (JSON Schema)"] --> SEND["2. Send task + definitions to model"]
  SEND --> CALL["3. Model emits tool name + arguments"]
  CALL --> EXEC["4. Harness executes the function"]
  EXEC --> BACK["5. Result returns as a tool message"]
  BACK -->|"model continues"| SEND

Core knowledge

The mechanism

Tool calling is a five-step exchange: declare the tool definitions (a name, a description, and a JSON-Schema for the arguments), supply the matching functions, send the task and the definitions to the model, execute the call the model requests, and feed the result back as a tool-role message keyed to the call id. The model acts as a mediator between the request and the tools; it chooses and parameterises calls but the host system runs them.1 Function calling is the disciplined form of this, where arguments must satisfy a declared schema so calls parse and validate reliably.

Defining tools without hand-writing schemas

A schema can be generated from a typed function signature by inspecting its parameters and docstring, so the definition stays in sync with the code. A small wrapper (a @tool decorator or a function-tool adapter) bundles the callable with its generated definition and an async execute method, and the same wrapper can expose local functions and remote tools through one interface.2 Tools fall into three rough kinds: information-augmenting (search, retrieval), action-executing (write a file, send a request), and domain-specialised.

Designing tools the model can use

Tool quality drives agent reliability more than prompt wording. The rules that matter:2

  • Write the definition for a competent stranger: if an intern could use the tool from the schema alone, so can the model.
  • Use enums instead of several conflicting booleans, so invalid combinations cannot be expressed.
  • Do not make the model resupply an identifier the code already holds; pass it in the harness, not through the model.
  • Keep the tool set small (roughly twenty or fewer general tools). Every extra tool is another decision the model can get wrong, which is the same minimal-tool-set finding from harness architecture.

Tool use is also a trained skill, not only a prompting trick: models have been taught to call APIs (Toolformer) and to emit correct calls across thousands of APIs (Gorilla), and tool behaviour is refined further with agentic RL.34

Sharing tools: the Model Context Protocol

The Model Context Protocol (MCP) standardises how tools are exposed to agents through a host, client, and server split with stdio, HTTP, or WebSocket transports. A server can wrap a function into an MCP tool with a single decorator, and a client can load remote MCP tools and present them to the agent identically to local ones, so the agent does not care whether a tool runs in-process or across a network.25 MCP is how a tool written once becomes available to many agents.

Code execution as the general tool

A fixed tool catalogue cannot cover everything. Giving the agent a code interpreter (the CodeAct pattern) turns an open-ended action into a single tool: the model writes code, the harness runs it in a sandbox, and the output returns as an observation. Code execution also lets the model escape its probabilistic nature for exact computation. It must run isolated and time-limited, never on the host.2 Related is the Agent-Skills idea of progressive disclosure, where capabilities are revealed to the model only as needed rather than all at once, keeping the tool surface small.

Don't-miss checklist

  • Generate schemas from typed signatures so definitions never drift from the code.
  • Write definitions a stranger could use; prefer enums; never make the model resupply known identifiers.
  • Keep the tool set small and general; justify every addition.
  • Expose shared tools over MCP so they are reusable and transport-agnostic.
  • Treat code execution as a powerful tool that must be sandboxed and time-limited.
  • Validate tool arguments and return structured errors the loop can act on.

Failure modes

  • Vague definitions. The model misuses or avoids the tool; accuracy drops with high token use.
  • Schema drift. A hand-written schema diverges from the function; calls fail to parse or validate.
  • Tool sprawl. Too many narrow tools; the model picks wrong or stalls.
  • Identifier laundering. The model is asked to supply an id the code has, inviting hallucinated values.
  • Unsandboxed code execution. Model-written code runs on the host; one bad call compromises the machine.
  • Opaque failures. A tool throws and returns nothing structured; the model cannot recover.

Open questions & validation

  • Tool-call validity rate is the metric to watch; measure it per tool on real traffic (observability).
  • MCP server trust is a supply-chain question; treat third-party tool servers as untrusted input.
  • How many tools a given model handles before selection degrades is model-specific; test it.

References

  • Schick et al., Toolformer: Language Models Can Teach Themselves to Use Tools: https://arxiv.org/abs/2302.04761
  • Patil et al., Gorilla: Large Language Model Connected with Massive APIs: https://arxiv.org/abs/2305.15334
  • Model Context Protocol: https://modelcontextprotocol.io/
  • Song & Hur, Build an AI Agent (From Scratch), Manning Publications (MEAP), 2026.
  • OWASP Top 10 for LLM Applications (tool and plugin risks): https://genai.owasp.org/llm-top-10/

Related: The agent loop · Harness architecture · Sandboxing & isolation · Policy, guardrails & governance · Agentic & tool-use RL · Agentic systems


  1. Song & Hur (Ch 3): the model emits a structured tool name and arguments and acts as a mediator; the host executes the call and returns the result as a tool-role message keyed by call id. Five steps: definitions, functions, send, execute, reflect back. 

  2. Song & Hur, Build an AI Agent (From Scratch), Manning MEAP, 2026. Schemas generated from typed signatures, tool-design rules (intern test, enums, no identifier laundering, under ~20 tools), MCP host/client/server with FastMCP and load-mcp-tools, and CodeAct code execution in a sandbox with Agent-Skills progressive disclosure. 

  3. Schick et al., Toolformer: a model trained to decide which APIs to call, when, and how. 

  4. Patil et al., Gorilla: a model trained to emit correct calls across more than 1,600 APIs. 

  5. Model Context Protocol: an open standard (introduced by Anthropic) for exposing tools and context to agents over a host/client/server architecture.