OpenHands Agent SDK: building production software agents¶
Scope: the OpenHands Software Agent SDK (arXiv 2511.03690, MLSys 2026), the complete architectural redesign that turned the OpenHands application into a four-package developer toolkit for production software agents. This page covers the V0-to-V1 design principles, the event-sourced state model and its measured overhead, the typed tool and LLM abstractions, the security and secrets layers, local-to-remote deployment, and the reliability and benchmark evidence. The original platform paper (event stream, CodeActAgent, the 15-benchmark harness) is covered in OpenHands agent platform; this page is about the SDK that replaced that generation's internals. The generic anatomy of the layer it implements is agent harness architecture.
Numbers are from the paper (v2, April 2026: production rollout, overhead measurements, benchmark tables) and were not reproduced here; the project moves fast, so verify current APIs on the repo (as of 2026-07). SDK snippets are reference templates quoted from the paper, unexecuted. The Python persistence example is executed and asserted (stdlib only).
flowchart TB
subgraph PKGS["Four decoupled packages"]
SDK["openhands.sdk<br/>Agent, Conversation, LLM, Tool, MCP, events"]
TOOLS["openhands.tools<br/>concrete tool implementations"]
WS["openhands.workspace<br/>Docker / hosted execution environments"]
SRV["openhands.agent_server<br/>REST + WebSocket server"]
end
APP["Application (CLI, GUI, GitHub app, custom)"] --> SDK
TOOLS --> SDK
WS --> SDK
WS -->|"spawns + connects"| SRV
CONV{"Conversation(agent, workspace)"} -->|"LocalWorkspace: in-process"| LOCAL["LocalConversation"]
CONV -->|"RemoteWorkspace: HTTP/WS"| REMOTE["RemoteConversation to agent server in container<br/>(VS Code Web, VNC, Chromium bundled)"]
SDK --> CONV
What it is¶
The OpenHands Software Agent SDK is the foundation of OpenHands V1: a rebuild of the agent core of the OpenHands framework (64k+ GitHub stars over 18 months) as a standalone, MIT-licensed toolkit rather than a monolithic application.1 The paper is explicit that this is a reference architecture: V0 combined agent logic, evaluation, and applications in one codebase, which produced rigid sandboxing assumptions, configuration sprawl (140+ fields, 15 classes, 2.8K lines of configuration code), and tight research-production coupling.
V1 distills the lessons into four principles: optional isolation (the agent runs in-process by default and containerizes transparently when needed, aligning with MCP's local-first assumptions); stateless by default, one source of truth (agents, tools, and LLMs are immutable, serializable Pydantic models validated at construction; all mutation lives in one ConversationState); strict separation of concerns (applications consume the SDK as a library); and two-layer composability (four deployment packages compose, and typed components extend the SDK without touching its core).1 The default-case API is a few lines (reference template):
# Reference template (paper Fig. 2; pin the release, as of 2026-07).
from openhands.sdk import LLM, Conversation
from openhands.tools.preset.default import get_default_agent
llm = LLM(model="openhands/claude-sonnet-4-5-20250929", api_key="...")
agent = get_default_agent(llm=llm)
conversation = Conversation(agent=agent, workspace="/path/to/project")
conversation.send_message("Write 3 facts about this project into FACTS.txt.")
conversation.run()
Why use it¶
- Measured reliability, not claimed. In a 15-day parallel production rollout, V1 cut system-attributable failures from 78.0 to 30.0 per 1k conversations (61%). The eliminated class was V0's inter-pod plumbing: HTTP 401 failures between pods (43.0/1k), runtime-readiness races (18.8/1k), and connection timeouts (3.1/1k) all go to zero under V1's co-located execution.2
- Event sourcing is effectively free. Replaying 39,870 events from 433 SWE-Bench Verified conversations through the production store: median per-event persist 0.20 ms, full state replay 4.1 ms median, crash recovery under 20 ms typical (32.1 ms at the longest 358-event conversation), 380 KB median storage per conversation, all negligible against 1-30 s LLM round trips.3
- Capability preserved, then improved. On SWE-Bench Verified with matched models, V0 and V1 tie at 68.0% (Claude Sonnet 4); with Sonnet 4.5, V1 gains +8.2 points (64.6% to 72.8%), attributed to extended-thinking support that the event-sourced design integrates naturally.4
- A feature set the provider SDKs do not combine. Against OpenAI Agents SDK, Claude Agent SDK, Google ADK, and LangChain/LangGraph (versions as of October 2025), the paper's comparison finds the OpenHands SDK unique in combining native sandboxed remote execution, an LLM-powered per-action security analyzer, model-agnostic multi-LLM routing with support for non-function-calling models, secrets auto-masking, stuck detection, and built-in academic benchmark evaluation.5
When to use it (and when not)¶
- Use it to embed a software agent in a product: the same code runs in-process for prototyping and against a containerized agent server in production, switched by the workspace argument.
- Use it as a research harness when experiments need deterministic replay and resumable long runs; the event log plus immutable configuration is exactly the reproducibility substrate evaluation needs (evaluating agents).
- Use it when the model mix is heterogeneous: LiteLLM-backed support for 100+ providers, a Responses-API path for reasoning models, text-prompt tool calling for models without native function calling, and
RouterLLMfor per-request model selection. - Do not expect multi-agent orchestration from the core. The current implementation focuses on single-agent conversations; delegation exists as a blocking parallel tool, and richer coordination is explicitly future design.1
- Do not treat the security analyzer as a guarantee. The paper's own limitations flag LLM-based risk classification as subject to adversarial prompts and inconsistency; it is a control layer, not a sandbox substitute (sandboxing and isolation).
Architecture¶
Event-sourced state. Every interaction is an immutable event in a typed hierarchy: LLMConvertibleEvent subclasses (MessageEvent, ActionEvent, SystemPromptEvent, CondensationSummaryEvent, observation events including UserRejectObservation and AgentErrorEvent) carry to_llm_message() and form the model-visible history, while internal events (state updates, condensation requests and results, PauseEvent) stay out of the LLM's view. ConversationState is the single mutable object: metadata fields (agent status, stats, confirmation policy) plus an append-only EventLog, guarded by a FIFO lock with two update paths (state-only versus event-append).1
Persistence is dual-path. Metadata serializes to one base_state.json on each change; events persist as individual JSON files, so incremental progress never rewrites history. Resume loads the base state, replays events, detects an incomplete conversation (an action without its observation), and continues from the last processed event. The executed example below validates exactly this contract.
Tools are typed contracts. The Action-Execution-Observation pattern validates LLM-proposed JSON against a Pydantic Action schema before execution and converts results through a structured Observation; MCP tools pass through the same abstraction (MCPToolDefinition over FastMCP), making external tools indistinguishable from native ones. Because executors are non-serializable, a registry maps tool names to resolvers, letting tool specifications cross process boundaries as pure JSON and re-instantiate lazily with environment state; this is what makes the local and remote paths uniform.1
The agent is a stateless event processor. Agent objects are immutable specifications (LLM settings, tool specs, security policy, context) that execute step-by-step, emitting events through callbacks. That single design choice buys security interleaving (risk review before execution), pause/resume, and real-time streaming. AgentContext carries skills (programmatic, or markdown from .openhands/skills/ and compatible formats such as agents.md), either always active or keyword-triggered. The Condenser writes summaries into the log as events and drops forgotten history at prompt-build time; the default LLMSummarizingCondenser is reported to cut API cost up to 2x with no performance degradation.1
Security and secrets. A SecurityAnalyzer rates each tool call low/medium/high/unknown; a ConfirmationPolicy decides whether to pause in a WAITING_FOR_CONFIRMATION state for approval. The built-in pair is LLMSecurityAnalyzer (a security_risk field on tool calls) and ConfirmRisky (blocks above a threshold, default high), both replaceable without touching executors. SecretRegistry binds credentials late, per conversation, masks secret values in tool output (<secret-hidden>), redacts them from serialization, supports callable refreshers, and rotates live mid-conversation.1
Local to remote. Conversation(agent, workspace) is a factory: a path or LocalWorkspace yields an in-process LocalConversation; a RemoteWorkspace yields a RemoteConversation that serializes the agent configuration to an agent server (POST /conversations, WebSocket event streaming). Official Docker images bundle the server with VS Code Web, VNC, and Chromium for human inspection of a running agent.1
How to use it¶
The local-to-remote move is the operational core (reference template, paper Fig. 5):
# Reference template: the only change from local to sandboxed remote execution.
from openhands.sdk import LLM, Conversation
from openhands.tools.preset.default import get_default_agent # paper Fig. 5 prints sdk.preset; the live repo ships tools.preset
from openhands.workspace import DockerWorkspace
llm = LLM(model="anthropic/claude-sonnet-4.1", api_key="...")
agent = get_default_agent(llm=llm)
with DockerWorkspace(...) as workspace: # local run: use Conversation(agent=agent)
conversation = Conversation(agent=agent, workspace=workspace)
conversation.send_message("Create hello.py")
conversation.run()
Multi-LLM routing subclasses RouterLLM and implements select_llm(); the paper's example routes image-bearing messages to a multimodal model and everything else to a cheaper one, with no agent changes.
How to develop with it¶
The contract worth internalizing before extending anything is the state model: metadata in one document, events append-only and write-once, and resume as replay plus unmatched-action detection (the paper documents type-safe serialization via discriminated unions; the versioned-schema rejection below is this page's hardening of that contract, not a paper-documented mechanism). This model of it is executed and asserted, including the crash-mid-action case and two rejected snapshots:
# conversation_state.py - validated: the SDK's dual-path persistence contract in
# miniature. Metadata writes to one base-state document, events append as one
# record each, resume replays the log and continues from the last unmatched
# action. A model of the contract (stdlib only); it does not run the SDK.
from __future__ import annotations
import json
from dataclasses import asdict, dataclass
SCHEMA_VERSION = 1
EVENT_KINDS = ("MessageEvent", "ActionEvent", "ObservationEvent")
@dataclass(frozen=True)
class Event:
seq: int
kind: str
payload: str
def __post_init__(self) -> None:
assert self.kind in EVENT_KINDS, f"unknown event kind: {self.kind}"
@dataclass
class ConversationState:
agent_status: str = "idle"
events: tuple[Event, ...] = ()
def persist_event(store: dict[str, str], ev: Event) -> None:
"""EventStore path: one immutable record per event, append-only."""
key = f"events/{ev.seq:05d}.json"
assert key not in store, "event files are write-once"
store[key] = json.dumps({"schema": SCHEMA_VERSION, **asdict(ev)})
def persist_metadata(store: dict[str, str], state: ConversationState) -> None:
"""base_state.json path: rewritten only on metadata change."""
store["base_state.json"] = json.dumps(
{"schema": SCHEMA_VERSION, "agent_status": state.agent_status})
def load(store: dict[str, str]) -> ConversationState:
"""Resume: load base state, replay the event log in order, fail fast on
schema mismatch or malformed records."""
base = json.loads(store["base_state.json"])
if base.get("schema") != SCHEMA_VERSION:
raise ValueError(f"incompatible snapshot schema: {base.get('schema')}")
events = []
for key in sorted(k for k in store if k.startswith("events/")):
rec = json.loads(store[key])
if rec.pop("schema", None) != SCHEMA_VERSION:
raise ValueError(f"incompatible event schema in {key}")
events.append(Event(**rec))
assert [e.seq for e in events] == list(range(len(events))), "gap in event log"
return ConversationState(base["agent_status"], tuple(events))
def unmatched_action(state: ConversationState) -> Event | None:
"""Crash detection: an ActionEvent whose ObservationEvent never landed."""
pending = None
for ev in state.events:
if ev.kind == "ActionEvent":
pending = ev
elif ev.kind == "ObservationEvent":
pending = None
return pending
def run_turn(store: dict[str, str], state: ConversationState, actions: list[str],
crash_after_action: int | None = None) -> ConversationState:
"""Drive action/observation cycles with per-event persistence; optionally
crash after persisting the Nth action but before its observation."""
for i, cmd in enumerate(actions):
act = Event(len(state.events), "ActionEvent", cmd)
persist_event(store, act)
state = ConversationState(state.agent_status, state.events + (act,))
if crash_after_action == i:
return state # process dies here
obs = Event(len(state.events), "ObservationEvent", f"ok:{cmd}")
persist_event(store, obs)
state = ConversationState(state.agent_status, state.events + (obs,))
return state
# 1) Round-trip identity and incremental persistence (one record per event,
# base_state.json untouched by event appends).
store: dict[str, str] = {}
state = ConversationState()
persist_metadata(store, state)
state = run_turn(store, state, ["ls", "make test"])
n_files = len(store)
assert load(store) == state # exact state reconstruction
assert n_files == 1 + 4 # base_state.json + 4 event records
# 2) Crash recovery: the interrupted run resumes at the exact unfinished action;
# the completed history equals the uninterrupted run, nothing lost or redone.
crash_store: dict[str, str] = {}
persist_metadata(crash_store, ConversationState())
run_turn(crash_store, ConversationState(), ["ls", "make test"], crash_after_action=1)
recovered = load(crash_store)
pending = unmatched_action(recovered)
assert pending is not None and pending.payload == "make test"
obs = Event(len(recovered.events), "ObservationEvent", f"ok:{pending.payload}")
persist_event(crash_store, obs)
recovered = ConversationState(recovered.agent_status, recovered.events + (obs,))
assert [ (e.kind, e.payload) for e in recovered.events ] == \
[ (e.kind, e.payload) for e in state.events ]
assert unmatched_action(recovered) is None
# 3) Fail fast: an incompatible snapshot version and a malformed event record
# are both rejected, never silently loaded.
bad = dict(store)
bad["base_state.json"] = json.dumps({"schema": 0, "agent_status": "idle"})
try:
load(bad)
raise SystemExit("incompatible schema must be rejected")
except ValueError:
pass
tampered = dict(store)
tampered["events/00001.json"] = json.dumps(
{"schema": SCHEMA_VERSION, "seq": 1, "kind": "EvilEvent", "payload": "x"})
try:
load(tampered)
raise SystemExit("malformed event kind must be rejected")
except AssertionError:
pass
print("files after 2 cycles:", n_files, "(1 base_state.json + 4 event records)")
print("crash recovery resumed at:", pending.payload)
print("all conversation-state assertions passed")
Output: files after 2 cycles: 5 (1 base_state.json + 4 event records), crash recovery resumed at: make test, all conversation-state assertions passed. Extension points beyond state: custom tools are an Action/Observation schema pair plus a ToolExecutor; new agents, condensers, security analyzers, and confirmation policies are typed components registered without core changes; delegation-style coordination is implemented entirely as a tool, which the paper offers as the proof that orchestration needs no framework surgery.
How to maintain it¶
Adopt the paper's own three-tier QA shape: programmatic tests with mocked LLM calls on every commit; LLM-based integration and example tests daily and on demand ($0.5 to $3 and under 5 minutes per run, on real models); benchmark evaluation on demand ($100 to $1000, hours). The one production incident class V1 did suffer is instructive: 29.7/1k SDK errors during the rollout were dominated by a condensation bug triggered when a provider added constraints on the event/message interface for extended thinking (since fixed). Provider-side interface drift is a first-class regression source, so pin model IDs and re-run the LLM-based tier when a provider changes anything about reasoning or message formats.2 Track current per-model results on the continuously updated OpenHands Index rather than freezing the paper's table.
Running it in production¶
Deploy the agent server in its official container (API server plus VS Code Web, VNC, Chromium), one container per agent instance with dedicated filesystem and resources; that is the paper's SaaS-style multi-tenancy unit, with per-conversation secrets and session-based authentication as the access-control foundation and a full multi-tenant security audit explicitly future work.1 Wire confirmation policy to risk appetite per route (read-only operations can auto-approve; mutating actions above the threshold pause for approval), rotate credentials through the secret registry rather than environment baking, and persist event logs as the audit artifact (identity and access, orchestration and control plane). Benchmark evidence for capability at this layer: best-per-category SDK results of 76.6% SWE-Bench Verified (Opus 4.5), 56.2% Commit0 (GPT-5.4), 44.1% SWE-Bench Multimodal (Gemini 3.1 Pro), 78.8% SWT-Bench (Opus 4.6), and 80.0% GAIA test (Opus 4.6), state of the art on three of five categories with single-model runs.4
Failure modes¶
- Security analysis treated as isolation. The LLM risk rater misclassifies under adversarial prompts by the paper's own admission; anything irreversible needs the sandbox boundary and confirmation policy, not just the analyzer.
- Provider interface drift breaking internals. The condensation incident shows agent-side machinery coupled to provider message semantics; upgrades to reasoning formats or tool-call schemas warrant the LLM-based test tier before rollout.2
- Recreating V0's configuration sprawl. The SDK's immutability discipline only helps if applications keep their own configuration equally explicit; layering mutable overrides on top reintroduces the two-identical-runs-diverge bug class the redesign exists to kill.
- Non-serializable state smuggled across boundaries. Executors do not serialize; anything crossing process lines must go through the tool registry's spec-plus-resolver pattern, or remote execution breaks in ways local testing never shows.
- Benchmark numbers read as model-agnostic. Task-model specialization is strong (Claude models lead issue resolution and testing; GPT-5.4 leads greenfield, though the paper's own +12.5-point margin conflicts with its Table 5 runner-up value); pick per task category, and re-check the live index.4
- Single-agent assumptions. Coordination beyond blocking delegation is future work; building multi-agent products on interleaved event logs today means owning the coordination semantics yourself.
References¶
- Wang et al., The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents (MLSys 2026, arXiv 2511.03690): https://arxiv.org/abs/2511.03690
- Hugging Face paper page: https://huggingface.co/papers/2511.03690
- SDK repository (MIT): https://github.com/OpenHands/software-agent-sdk
- Benchmark harness: https://github.com/OpenHands/benchmarks
- OpenHands Index (live per-model results): https://index.openhands.dev
- OpenHands documentation: https://docs.openhands.dev
- Wang et al., OpenHands: An Open Platform for AI Software Developers as Generalist Agents (the platform paper, arXiv 2407.16741): https://arxiv.org/abs/2407.16741
Related: OpenHands agent platform · Agent harness architecture · The agent loop · Sandboxing and isolation · Orchestration and control plane · Running local coding agents · Identity and access · Context and memory
-
arXiv 2511.03690v2 (MLSys 2026): four packages (sdk, tools, workspace, agent_server); event hierarchy with discriminated-union serialization; ConversationState as sole mutable state (metadata fields + append-only EventLog, FIFO lock, dual-path persistence with base_state.json plus per-event JSON files); LiteLLM-backed LLM layer (100+ providers, Responses API, ThinkingBlock/ReasoningItemModel, NonNativeToolCallingMixin, RouterLLM); Action-Execution-Observation tool contract with MCP via FastMCP and registry-based spec resolution; stateless immutable agents with on_event callbacks; AgentContext skills (.openhands/skills/, agents.md-compatible, keyword triggers); LLMSummarizingCondenser (up to 2x cost reduction, no degradation); SecretRegistry (late-bound,
masking, live rotation, cipher-encrypted serialization); SecurityAnalyzer + ConfirmationPolicy (LLMSecurityAnalyzer, ConfirmRisky, WAITING_FOR_CONFIRMATION); Conversation/Workspace factory (LocalConversation in-process; RemoteConversation over REST/WebSocket; DockerWorkspace, APIRemoteWorkspace); V0 config sprawl 140+ fields / 15 classes / 2.8K lines; limitations: single-agent focus, LLM-based security fallibility. ↩↩↩↩↩↩↩↩↩ -
Paper Section 5.1 (15-day parallel production rollout, errors per 1k conversations): V0 78.0 total system-attributable (HTTPStatusError 401 at 43.0, AgentRuntimeNotReadyError 18.8, connection/timeout 3.1, from inter-pod communication) vs V1 30.0 (infrastructure 0.0; SDK 29.7, dominated by a condensation bug during extended-thinking rollout, since fixed): a 61% reduction. LLM-provider errors excluded as external to both. ↩↩↩
-
Paper Section 5.2, replaying 433 SWE-Bench Verified conversations (39,870 events) through the production LocalFileStore: per-event persist 0.20 ms median / 0.31 ms P95; action-cycle persist 0.40 / 0.56 ms; full replay 4.1 / 9.7 / 18.9 ms (median / P95 / max at 358 events); crash recovery 7.4 / 14.9 / 32.1 ms; storage 380 KB / 1.4 MB / 3.4 MB per conversation. ↩
-
Paper Section 5.4: five categories (SWE-Bench Verified, Commit0, SWE-Bench Multimodal, SWT-Bench, GAIA), 14 models (7 closed, 7 open-weights); V0-vs-V1 matched-model: Sonnet 4 at 68.0% both, Sonnet 4.5 64.6% to 72.8% (+8.2, extended thinking); best SDK per category 76.6% / 56.2% / 44.1% / 78.8% / 80.0% vs published SOTA 79.2% / 12.5% / n.a. / 84.0% / 74.6% (SOTA on 3 of 5, within 2.6 and 5.2 points otherwise, single-model runs); GPT-5.4 leads greenfield at 62.5%, +12.5 over second place. ↩↩↩
-
Paper Table 6 (assessed from documentation as of October 2025: OpenAI Agents SDK v0.4.2, Claude Agent SDK v0.1.6, Google ADK v1.17.0, LangChain v1.0.3 / LangGraph v1.0.2, OpenHands v1.0.0): OpenHands uniquely combines native remote execution with sandboxing, per-action LLM security analysis, model-agnostic multi-LLM routing with non-function-calling support, secrets auto-masking, agent stuck detection, and built-in academic benchmark evaluation; LangGraph is the closest open-source alternative on server/deployment features. ↩