Markdown

GenAI observability with OpenTelemetry¶

Scope: the OpenTelemetry semantic conventions for generative AI, the gen_ai.* namespace of span attributes, metrics, and events that standardizes what an LLM call, tool invocation, or agent run emits. This page covers the span taxonomy (inference, embeddings, retrieval, memory, tool, and agent spans), attribute requirement levels and the opt-in content-capture model, the metric instruments and their buckets, and the operational policy questions (cost accounting, PII, cardinality) that follow. Why trajectory capture matters at all is agent observability; cluster-level telemetry is observability and monitoring and the Prometheus/DCGM stack. This page is the convention layer in between: the names your instrumentation should emit.

The GenAI conventions carry Development status (not yet stable) and moved from the main semantic-conventions repository into a dedicated semantic-conventions-genai repository; attribute names below were verified against that repo as of 2026-07 and can still change. Instrumentation and dashboard snippets are reference templates from the OpenTelemetry blog walkthrough, unexecuted. The Python data-model example is executed and asserted.

flowchart LR
  subgraph APP["LLM app / agent harness"]
    AG["invoke_agent span"] --> CH["chat spans (one per LLM call)"]
    AG --> ET["execute_tool spans"]
  end
  CH --> SIG["gen_ai.* attributes, events,<br/>duration + token metrics"]
  ET --> SIG
  SIG --> OTLP["OTLP export (4317/4318)"]
  OTLP --> COL["Collector / backend<br/>(any OTLP-compatible)"]
  COL --> UI["Trace viewer + GenAI visualizer<br/>(chat-style rendering of captured content)"]

What it is¶

The OpenTelemetry Semantic Conventions for Generative AI define how GenAI operations are recorded so that any instrumented app, assistant, or framework produces telemetry any OTLP backend can interpret: the model being called, input and output token counts, finish reasons, and, when explicitly opted in, the full content of prompts, completions, tool calls, and tool results.¹ The conventions live in a dedicated repository (open-telemetry/semantic-conventions-genai) after moving out of the main semantic-conventions tree, and carry Development status: already in use, still evolving, shaped by the GenAI Semantic Conventions and Instrumentation SIG.²

The span taxonomy covers the whole agent loop. Inference spans represent one client call to a model, named {gen_ai.operation.name} {gen_ai.request.model} (for example chat gpt-4o), with well-known operation names chat, generate_content, text_completion, and embeddings. Agent and framework spans add create_agent, invoke_agent, invoke_workflow, plan, and execute_tool spans, so a whole agent turn is one tree: a top-level invoke_agent span with child chat spans for each LLM call and execute_tool spans for each tool invocation.¹ Retrieval and memory spans cover the RAG side. A span covers the logical operation including automatic retries, and carries CLIENT kind for out-of-process calls.

Real coding assistants already emit this telemetry: the blog walkthrough uses VS Code Copilot (traces, metrics, and events per agent interaction, enabled via github.copilot.chat.otel.enabled), and notes OpenAI Codex exports structured log events and OTel metrics while Claude Code exports metrics and log events with trace support in beta.¹

Why use it¶

Latency decomposition instead of guessing. An agent that takes 45 seconds to answer a simple question is the post's opening problem: without spans you cannot tell the model, a slow tool call, and a retry loop apart. The span tree answers it structurally.¹
One schema across providers and services. gen_ai.provider.name and gen_ai.request.model are required or conditionally required on every inference span, so a fleet mixing OpenAI, Anthropic, and self-hosted vLLM lands in one queryable namespace instead of per-vendor logging.
Token accounting is built in. gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on spans, plus the gen_ai.client.token.usage histogram split by gen_ai.token.type, give per-request cost estimation, token-hungry-prompt detection, and usage monitoring across models and agents.¹ The spec adds the operationally important detail that when a system reports both used and billable tokens, instrumentation MUST report billable ones.³
Privacy is the default, content is the option. By default no prompt content or tool arguments are captured, only metadata (model names, token counts, durations); enabling content capture populates gen_ai.system_instructions, gen_ai.input.messages, and gen_ai.output.messages with the full conversation, tool schemas, arguments, and results.¹
Cache and reasoning visibility. The conventions carry gen_ai.usage.cache_read.input_tokens and gen_ai.usage.cache_creation.input_tokens (provider-managed prompt caches) and gen_ai.usage.reasoning.output_tokens (extended thinking), exactly the quantities that dominate modern serving bills.

When to use it (and when not)¶

Use it for any LLM-powered service or agent heading to production: the conventions are the interoperable way to answer "which model, how many tokens, how long, which tools" across services and backends.
Use it to monitor assistants you did not write; Copilot, Codex, and Claude Code emit OTel today, so a platform team can watch fleet-wide coding-agent usage with a collector and no code changes.¹
Use content capture selectively. Full prompts and tool arguments are invaluable for debugging and evaluation but are sensitive data with real storage weight; the default-off design exists for a reason.
Do not treat it as evaluation. Telemetry records what happened, not whether it was good; quality gating is agent evaluation (the conventions do define a gen_ai.evaluation.result event for carrying evaluation scores alongside traces).
Do not hard-code against today's names without pinning. Development status means renames happen; instrument through libraries that track the conventions and record the schema version.

Architecture¶

Requirement levels do the design work. On an inference span, only gen_ai.operation.name and gen_ai.provider.name are Required. Conditionally required attributes appear when applicable: gen_ai.request.model if available, error.type on failure, gen_ai.conversation.id when a session exists, gen_ai.request.stream and gen_ai.output.type when relevant. The Recommended tier carries the operational payload: request parameters (gen_ai.request.temperature, gen_ai.request.max_tokens), response identity (gen_ai.response.model, gen_ai.response.id, gen_ai.response.finish_reasons), usage counts, gen_ai.response.time_to_first_chunk for streaming, and server.address. Content attributes (gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions, gen_ai.tool.definitions) sit in the Opt-In tier, structured as role-and-parts message JSON rather than raw provider payloads.²

Metrics are histograms with prescribed buckets. gen_ai.client.operation.duration (unit s, required) uses explicit boundaries from 0.01 to 81.92 seconds; gen_ai.client.token.usage (recommended) uses power-of-four-style boundaries from 1 to 67,108,864 tokens with gen_ai.token.type distinguishing input from output. Streaming adds gen_ai.client.operation.time_to_first_chunk and time_per_output_chunk; the server side defines gen_ai.server.request.duration, time_per_output_token, and time_to_first_token; workflow, agent, and tool durations (gen_ai.workflow.duration, gen_ai.invoke_agent.duration, gen_ai.execute_tool.duration) cover the orchestration layers.³

Events carry the verbose record. gen_ai.client.inference.operation.details is the event-shaped alternative for full call details, and gen_ai.evaluation.result attaches evaluation name, score value, label, and explanation to the trace. For content too large for attributes, the spans document describes uploading content to external storage and referencing it, and a dedicated section covers streaming-chunk capture.²

How to use it¶

The blog's walkthrough shape (reference templates, unexecuted; May 2026 versions, verify current settings): enable emission in the tool, point it at an OTLP endpoint, view in any OTLP backend. For VS Code Copilot the settings are github.copilot.chat.otel.enabled: true, github.copilot.chat.otel.captureContent for opt-in content, and github.copilot.chat.otel.otlpEndpoint (default http://localhost:4318).¹ A local viewer that needs no cloud account:

# Reference template (blog walkthrough): Aspire Dashboard as a local OTLP viewer.
docker run --rm -p 18888:18888 -p 4317:18889 -p 4318:18890 -d --name aspire-dashboard \
  -e ASPIRE_DASHBOARD_UNSECURED_ALLOW_ANONYMOUS=true \
  mcr.microsoft.com/dotnet/aspire-dashboard:latest
# telemetry to http://localhost:4318, UI at http://localhost:18888

Ask the assistant a question, open Traces, and the invoke_agent tree appears with per-call chat spans (model, token counts, finish reasons) and execute_tool spans. On the Metrics page, gen_ai.client.operation.duration filtered by gen_ai.request.model compares model latencies, and gen_ai.client.token.usage filtered by gen_ai.token.type separates input from output consumption.¹

How to develop with it¶

The data model is worth internalizing before wiring an SDK, and it is checkable without one. The executed example below models spans with the convention's real attribute names, enforces requirement levels and types, and rolls token usage and cost up an agent trace. Note the aggregation policy: the semconv also recommends aggregate usage tokens on invoke_agent spans, so a rollup must sum usage from one level only; this example sums inference spans and therefore requires agent spans to omit usage:

# genai_semconv_model.py - validated: the GenAI semconv data model in miniature.
# Typed spans carry the convention's real attribute names; a validator enforces
# requirement levels and types; a trace aggregator rolls usage and cost up an
# agent trace without double counting. Models the conventions (Development
# status, verify names on upgrade); it does not run an OpenTelemetry SDK.
from __future__ import annotations

from typing import Any

REQUIRED: dict[str, type] = {
    "gen_ai.operation.name": str,     # Required per the inference span table
    "gen_ai.provider.name": str,      # Required per the inference span table
}
TYPED: dict[str, type] = {
    "gen_ai.request.model": str,
    "gen_ai.response.model": str,
    "gen_ai.usage.input_tokens": int,
    "gen_ai.usage.output_tokens": int,
    "gen_ai.response.finish_reasons": list,
}
INFERENCE_OPS = {"chat", "generate_content", "text_completion", "embeddings"}


def validate_span(attrs: dict[str, Any]) -> None:
    """Reject spans that violate requirement levels, types, or count sanity."""
    for key, typ in REQUIRED.items():
        assert key in attrs, f"missing required attribute: {key}"
    for key, typ in {**REQUIRED, **TYPED}.items():
        if key in attrs:
            assert isinstance(attrs[key], typ), f"wrong type for {key}"
    for key in ("gen_ai.usage.input_tokens", "gen_ai.usage.output_tokens"):
        if key in attrs:
            assert attrs[key] >= 0, f"negative token usage: {key}"


def aggregate_trace(spans: list[dict[str, Any]]) -> dict[str, int]:
    """Roll up usage across a trace, summing ONE level only (inference spans).
    Semconv also recommends aggregate usage on invoke_agent spans; this
    aggregator requires agent spans to omit it so nesting cannot double count."""
    total = {"llm_calls": 0, "tool_calls": 0, "input_tokens": 0, "output_tokens": 0}
    for span in spans:
        validate_span(span)
        op = span["gen_ai.operation.name"]
        if op in INFERENCE_OPS:
            total["llm_calls"] += 1
            total["input_tokens"] += span.get("gen_ai.usage.input_tokens", 0)
            total["output_tokens"] += span.get("gen_ai.usage.output_tokens", 0)
        elif op == "execute_tool":
            total["tool_calls"] += 1
        else:
            assert op in {"invoke_agent", "create_agent"}, f"unknown op {op}"
            assert "gen_ai.usage.input_tokens" not in span, \
                "this aggregator sums inference spans only: agent-level usage would double count"
    return total


def cost_usd(totals: dict[str, int], in_per_m: float, out_per_m: float) -> float:
    return (totals["input_tokens"] * in_per_m
            + totals["output_tokens"] * out_per_m) / 1e6


def span(op: str, **extra: Any) -> dict[str, Any]:
    base: dict[str, Any] = {"gen_ai.operation.name": op, "gen_ai.provider.name": "openai"}
    return {**base, **extra}


trace = [
    span("invoke_agent"),
    span("chat", **{"gen_ai.request.model": "gpt-4o",
                    "gen_ai.usage.input_tokens": 1200,
                    "gen_ai.usage.output_tokens": 300,
                    "gen_ai.response.finish_reasons": ["tool_calls"]}),
    span("execute_tool"),
    span("chat", **{"gen_ai.request.model": "gpt-4o",
                    "gen_ai.usage.input_tokens": 1800,
                    "gen_ai.usage.output_tokens": 450,
                    "gen_ai.response.finish_reasons": ["stop"]}),
]

totals = aggregate_trace(trace)
assert totals == {"llm_calls": 2, "tool_calls": 1,
                  "input_tokens": 3000, "output_tokens": 750}, totals
# Cost at $2.50/M input and $10/M output: 3000*2.5/1e6 + 750*10/1e6 = 0.015.
assert abs(cost_usd(totals, 2.50, 10.00) - 0.015) < 1e-12

# Adversarial 1: missing required attribute is rejected.
try:
    validate_span({"gen_ai.request.model": "gpt-4o"})
    raise SystemExit("span without required attributes must be rejected")
except AssertionError as err:
    assert "missing required" in str(err)

# Adversarial 2: wrong type (string token count) is rejected.
try:
    validate_span(span("chat", **{"gen_ai.usage.input_tokens": "1200"}))
    raise SystemExit("string token count must be rejected")
except AssertionError as err:
    assert "wrong type" in str(err)

# Adversarial 3: negative usage is rejected.
try:
    validate_span(span("chat", **{"gen_ai.usage.input_tokens": -5}))
    raise SystemExit("negative token usage must be rejected")
except AssertionError as err:
    assert "negative" in str(err)

# Adversarial 4: agent-level usage alongside leaf usage would double count
# under this one-level aggregation policy; rejected.
try:
    aggregate_trace([span("invoke_agent", **{"gen_ai.usage.input_tokens": 3000})])
    raise SystemExit("usage on an agent span must be rejected")
except AssertionError as err:
    assert "double count" in str(err)

print("trace totals:", totals)
print(f"trace cost: ${cost_usd(totals, 2.50, 10.00):.3f}")
print("all GenAI semconv model assertions passed")

Output: trace totals: {'llm_calls': 2, 'tool_calls': 1, 'input_tokens': 3000, 'output_tokens': 750}, trace cost: $0.015, all GenAI semconv model assertions passed. When instrumenting a real app, prefer maintained instrumentation (the opentelemetry-python-contrib tree carries GenAI instrumentation packages; see References) over hand-rolled spans, and render captured content with a GenAI-aware visualizer, since the raw message attributes are structured JSON that generic trace UIs display poorly.¹

How to maintain it¶

Pin the convention version. Development status means attribute names can move between releases; record the semconv version your instrumentation targets and treat upgrades like schema migrations, re-validating dashboards and alerts that query gen_ai.* names.
Track the repo move. The conventions relocated from open-telemetry/semantic-conventions (whose gen-ai pages now redirect) to open-telemetry/semantic-conventions-genai; links, code generators, and vendored schema files need the new source.²
Keep instrumentation and SIG feedback flowing. The conventions are explicitly shaped by real-world usage reports; when a provider returns usage your instrumentation cannot map, that is an upstream issue, not a local hack.
Re-check assistant emitters after upgrades. Copilot, Codex, and Claude Code emit different subsets (Claude Code traces were beta at the post's writing); an assistant upgrade can add or change emitted signals.¹

Running it in production¶

Cost accounting. Derive spend from gen_ai.client.token.usage split by gen_ai.token.type, gen_ai.request.model, and service; include cache-read tokens (cheaper) separately from ordinary input tokens, and remember the billable-tokens rule when reconciling with invoices.³ Per-agent budgets from loop economics hang off exactly these series.
Content-capture policy. Treat content capture as a data-classification decision, not a debug flag: default off in production, scoped on for specific services or sampled traces during incident investigation, with retention and access controls matching the sensitivity of prompts and tool arguments.
Sampling. Metadata-only spans are cheap; content events are not. Head-sample traces but keep metrics unsampled, and consider tail-sampling for error traces where captured content pays for itself.
Cardinality. gen_ai.request.model and gen_ai.response.model are bounded sets; free-form attributes (conversation ids, prompt names) belong on spans, not as metric attributes, or the metrics backend pays the cardinality bill.

Failure modes¶

Content capture leaking PII. Turning captureContent on globally ships user prompts, tool schemas, and tool results to the telemetry backend; the conventions default this off for exactly that reason. Scope, sample, and classify before enabling.¹
Token counts that do not match the bill. Counts come from provider responses or offline tokenizers; mixed sources disagree, and the spec's billable-tokens rule exists because used and billable can differ (caching, batching discounts). Reconcile one source of truth.³
High-cardinality metric attributes. Putting response ids or per-conversation ids on the token-usage metric explodes series counts; keep identity on spans and aggregates on metrics.
Semconv version skew across services. One service emitting an old attribute name splits dashboards silently; version pinning and a collector-level rename processor during migration are the fixes.
Streaming blind spots. A streamed response without gen_ai.request.stream and time_to_first_chunk metrics reads as one long opaque call; TTFT regressions hide inside it. Instrument chunk timing for anything user-facing.
Double counting in rollups. The semconv recommends usage both on inference spans and, as an aggregate, on invoke_agent spans; summing across every span counts the same tokens twice. Pick one level (this page's example uses inference spans) and enforce it, as validated above.

References¶

Newton-King, Inside the LLM Call: GenAI Observability with OpenTelemetry (OpenTelemetry blog, May 2026): https://opentelemetry.io/blog/2026/genai-observability/
OpenTelemetry GenAI semantic conventions repository: https://github.com/open-telemetry/semantic-conventions-genai
GenAI span conventions (inference, embeddings, retrieval, memory, tool): https://github.com/open-telemetry/semantic-conventions-genai/blob/main/docs/gen-ai/gen-ai-spans.md
GenAI metrics conventions: https://github.com/open-telemetry/semantic-conventions-genai/blob/main/docs/gen-ai/gen-ai-metrics.md
GenAI event conventions: https://github.com/open-telemetry/semantic-conventions-genai/blob/main/docs/gen-ai/gen-ai-events.md
GenAI agent and framework span conventions: https://github.com/open-telemetry/semantic-conventions-genai/blob/main/docs/gen-ai/gen-ai-agent-spans.md
OpenTelemetry Python contrib (GenAI instrumentation packages): https://github.com/open-telemetry/opentelemetry-python-contrib

OpenTelemetry blog (Newton-King, 2026-05-14): GenAI semconv standardizes model, token counts, and opt-in content of prompts, completions, tool calls, and results; VS Code Copilot emits traces/metrics/events (settings github.copilot.chat.otel.enabled, .captureContent, .otlpEndpoint, default http://localhost:4318); Codex exports structured log events and OTel metrics, Claude Code metrics and log events with traces in beta; default capture is metadata-only; span tree invoke_agent with child chat and execute_tool spans; attributes shown include gen_ai.request.model, gen_ai.usage.input_tokens/output_tokens, gen_ai.response.finish_reasons, and content attributes gen_ai.system_instructions, gen_ai.input.messages, gen_ai.output.messages; metrics gen_ai.client.operation.duration and gen_ai.client.token.usage filtered by gen_ai.token.type; Aspire Dashboard walkthrough; conventions in use and under active development via the GenAI SemConv and Instrumentation SIG. ↩↩↩↩↩↩↩↩↩↩↩↩
semantic-conventions-genai repo (Development status throughout, verified 2026-07): inference span named {gen_ai.operation.name} {gen_ai.request.model}; Required gen_ai.operation.name (well-known values chat, generate_content, text_completion, embeddings, create_agent, invoke_agent, execute_tool) and gen_ai.provider.name; conditionally required gen_ai.request.model, error.type, gen_ai.conversation.id, gen_ai.request.stream, gen_ai.output.type; recommended usage attributes including gen_ai.usage.reasoning.output_tokens and cache read/creation input tokens; Opt-In gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions, gen_ai.tool.definitions in role/parts JSON; external-storage upload and streaming-chunk capture sections; events gen_ai.client.inference.operation.details and gen_ai.evaluation.result; the old main-repo gen-ai docs are marked moved. ↩↩↩↩
GenAI metrics conventions: gen_ai.client.operation.duration (required, histogram, seconds, boundaries 0.01 to 81.92), gen_ai.client.token.usage (recommended, histogram, boundaries 1 to 67,108,864, required attribute gen_ai.token.type in {input, output}; report billable tokens when both used and billable are known; do not report usage that cannot be measured efficiently), streaming client metrics (time_to_first_chunk, time_per_output_chunk), server metrics (gen_ai.server.request.duration, time_per_output_token, time_to_first_token), and workflow/agent/tool durations. ↩↩↩↩