Markdown

OpenHands (OpenDevin): an open platform for software-engineering agents¶

Scope: the OpenHands platform (formerly OpenDevin) as described in the paper that introduced it: the event-stream architecture that carries all agent state, the agent abstraction (a step function from state to action), the sandboxed runtime that turns actions into observations, the AgentSkills tool library, multi-agent delegation, and the built-in evaluation harness spanning 15 benchmarks. This page covers the platform itself; running open-weight coding agents generally is local coding agents, and the generic anatomy of the layer OpenHands implements is agent harness architecture.

Facts and numbers below are from the OpenHands paper (arXiv 2407.16741, mid-2024 models) and are dated by design; the project has since reorganized around an agent SDK and a browser-based control center, so verify current entry points on the repo before depending on them (as of 2026-07). Install and agent-definition snippets are reference templates, unexecuted. The Python event-stream example is executed and asserted (stdlib only).

flowchart LR
  USER["User (chat UI / CLI)"] --> ES["Event stream:<br/>chronological actions + observations"]
  AGENT["Agent.step(state) returns action<br/>(CodeActAgent, BrowsingAgent, micro-agents)"] --> ES
  ES --> RT["Runtime: per-session Docker sandbox<br/>(REST action-execution API)"]
  subgraph SANDBOX["Sandbox contents"]
    BASH["bash shell"]
    IPY["Jupyter IPython + AgentSkills"]
    BROWSER["Chromium via Playwright<br/>(BrowserGym primitives)"]
  end
  RT --> BASH
  RT --> IPY
  RT --> BROWSER
  RT -->|"observations"| ES
  AGENT -->|"AgentDelegateAction"| SUB["Specialist agent<br/>(e.g. BrowsingAgent)"]
  SUB --> ES

What it is¶

OpenHands (introduced as OpenDevin, renamed after the paper) is an MIT-licensed, community-driven platform for building agents that act the way a human developer does: by writing code, running commands, and browsing the web.¹ The paper's architecture has three parts. An agent abstraction reduces every agent to a step function that reads the current state and returns one action. An event stream records every action and observation in chronological order and is the state: what the agent perceives is a fold over that log, plus auxiliary bookkeeping such as accumulated LLM cost and delegation metadata. A runtime executes each action inside a per-session Docker sandbox and appends the result to the stream as an observation.

The core action space is deliberately small and programming-language-based: IPythonRunCellAction executes Python in a Jupyter server, CmdRunAction runs bash, and BrowseInteractiveAction drives a Chromium browser through the BrowserGym domain-specific language; MessageAction, AgentFinishAction, and AgentDelegateAction handle conversation, termination, and handoff to another agent. File editing and multi-modal reading are not separate action types but functions in the AgentSkills library, a Python package auto-imported into the IPython environment (edit_file, scroll_up/scroll_down, parse_image, parse_pdf). By the time of writing the paper reported an agent hub of over 10 community agents, more than 2.1K contributions from over 188 contributors, and 32K GitHub stars.

Since then the project has kept the name but widened the product: as of 2026-07 the repository describes a self-hosted control center for coding agents and automations (running OpenHands and other ACP-compatible agents across local, remote, and cloud backends), with the paper-era application preserved as the V0 reference. The architecture below is the paper's, and it remains the clearest published description of an event-stream agent platform.

Why use it¶

One platform instead of five components. The paper's framework comparison positions OpenHands as the only surveyed framework with a standardized tool library, built-in sandboxed code execution, a built-in browser, multi-agent delegation, human-in-the-loop chat, an agent hub, an evaluation framework, and agent quality-control tests together.¹
A generalist agent that holds up across categories. The same CodeActAgent, with an unmodified system prompt, scored competitively on software engineering, web, and assistance tasks at once, where most baselines specialize in one category.²
Evaluation is built in, with cost tracking. 15 benchmarks (SWE-bench, HumanEvalFix, BIRD, BioCoder, ML-Bench, APIBench, ToolQA, WebArena, MiniWoB++, GAIA, GPQA, AgentBench, MINT, EDA, ProofWriter) run under one harness that records per-instance dollar cost, which is how the paper can report SWE-bench Lite at 26.0% for $1.10 per instance.
Reproducible runtimes. Runtime images carry a dual tag: a hash tag derived from the build context (same hash, same contents) and a mutable generic tag per OpenHands version and base image. Pinning the hash tag gives byte-identical sandboxes across an evaluation campaign.

When to use it (and when not)¶

Use it as a research or evaluation platform: implementing a new agent against a stable action space, benchmarking it across task categories, or studying multi-agent delegation with the event stream as a complete audit log.
Use it to self-host a capable coding agent with a UI, when you want the sandbox, browser, and tool library maintained by a large community rather than assembled by hand.
Skip it when you only need a terminal coding assistant against a local model; a lighter harness wired to an OpenAI-compatible endpoint is simpler (local coding agents).
Skip the default sandbox as a security boundary for hostile multi-tenant workloads. A Docker container stops accidents, not a motivated adversary; layer the controls in sandboxing and isolation for anything internet-exposed.
Do not treat the paper's numbers as current. They were measured on mid-2024 models (gpt-4o, claude-3-5-sonnet); they rank architectures at a point in time, not today's ceiling.

Architecture¶

The event stream is the load-bearing choice. User interfaces, agents, and the runtime never call each other directly; each reads from and appends to the same chronological log, which makes the UI agent-agnostic and every run replayable by construction. The agent side is a pure mapping: step(state) -> action, where the state wraps the event history plus execution metadata. The runtime side is a client-server split: the OpenHands backend sends each action over a REST API to an action-execution server running inside the session's Docker container, which owns a bash shell, a Jupyter IPython server, and a Playwright-driven Chromium browser, and returns execution results as observations (for browsing: HTML, DOM, accessibility tree, screenshots, open tabs). Arbitrary base images are supported by a build step that layers the action-execution server into the user's image. Delegation is just another action: AgentDelegateAction hands a subtask to a specialist (the generalist CodeActAgent delegates browsing to BrowsingAgent), and the delegate's activity lands in the same stream.

The executed example below validates the contract this architecture depends on: state is a fold over an append-only event log, the step function is pure over history, replay is deterministic, truncation is observable, and unknown event types fail fast.

# openhands_event_stream.py -- validated: the event-stream contract OpenHands is built on.
# Executed and asserted (stdlib only). It validates the architectural core (state as an
# append-only event log, the agent as a pure step function over history, deterministic
# replay); it does not run the OpenHands product.
from __future__ import annotations

from dataclasses import dataclass
from typing import Union


@dataclass(frozen=True)
class CmdRunAction:
    command: str


@dataclass(frozen=True)
class AgentFinishAction:
    pass


@dataclass(frozen=True)
class CmdOutputObservation:
    exit_code: int
    output: str


Event = Union[CmdRunAction, AgentFinishAction, CmdOutputObservation]


def derive_state(events: tuple[Event, ...]) -> dict[str, object]:
    """Fold the event stream into the derived state the agent perceives."""
    state: dict[str, object] = {"steps": 0, "last_exit": None, "done": False}
    for ev in events:
        if isinstance(ev, CmdRunAction):
            state["steps"] = int(state["steps"]) + 1  # type: ignore[call-overload]
        elif isinstance(ev, CmdOutputObservation):
            state["last_exit"] = ev.exit_code
        elif isinstance(ev, AgentFinishAction):
            state["done"] = True
        else:
            raise AssertionError(f"unknown event type: {type(ev).__name__}")
    return state


def step(events: tuple[Event, ...]) -> Event:
    """Agent step: a pure function of the event history, as in the agent abstraction."""
    state = derive_state(events)
    if state["last_exit"] is None:
        return CmdRunAction("make test")
    if state["last_exit"] != 0:
        return CmdRunAction("make test FLAKY_RETRIES=1")
    return AgentFinishAction()


def run(script: dict[str, int]) -> tuple[Event, ...]:
    """Drive the agent against a scripted runtime until AgentFinishAction."""
    events: tuple[Event, ...] = ()
    while True:
        action = step(events)
        events = events + (action,)
        if isinstance(action, AgentFinishAction):
            return events
        assert isinstance(action, CmdRunAction)
        exit_code = script[action.command]  # unknown command fails fast (KeyError)
        events = events + (CmdOutputObservation(exit_code, f"exit={exit_code}"),)


script = {"make test": 1, "make test FLAKY_RETRIES=1": 0}
stream = run(script)

# 1. Deterministic replay: rerunning the loop reconstructs a value-identical event
#    stream and state, and the step function is pure (same history, same action).
assert run(script) == stream
assert derive_state(run(script)) == derive_state(stream)
assert step(stream[:2]) == step(stream[:2])
assert derive_state(stream) == {"steps": 2, "last_exit": 0, "done": True}

# 2. Context loss is observable: truncating the stream changes the derived state,
#    so a lossy history is a different state, never a silent no-op.
truncated = stream[:-2]
assert derive_state(truncated) != derive_state(stream)
assert derive_state(truncated)["done"] is False

# 3. Unknown event types fail fast instead of folding silently.
try:
    derive_state(stream + ("not-an-event",))  # type: ignore[operator]
    raise SystemExit("unknown event type must not fold")
except AssertionError as err:
    assert "unknown event type" in str(err)

print("event stream:", [type(e).__name__ for e in stream])
print("derived state:", derive_state(stream))
print("all event-stream assertions passed")

Output: event stream: ['CmdRunAction', 'CmdOutputObservation', 'CmdRunAction', 'CmdOutputObservation', 'AgentFinishAction'], derived state: {'steps': 2, 'last_exit': 0, 'done': True}, all event-stream assertions passed.

How to use it¶

The paper-era workflow: start the application, which spins up a per-session Docker sandbox with a configurable workspace directory mounted into it; interact through the chat UI, which visualizes bash commands, Python cells, and browser activity as they land in the event stream; interrupt and redirect the agent at any point (the UI reads the same stream, so human feedback is just another event). Model access is configured per provider; the paper's evaluations ran the same agents against gpt-4o-mini, gpt-4o, and claude-3-5-sonnet by switching the LLM configuration only. As of 2026-07 the documented entry points are the browser-based control center (Agent Canvas), a CLI, a cloud service, and the software agent SDK, with the paper-era local GUI kept as a legacy reference; take the exact install command from the current docs rather than a snapshot here.

How to develop with it¶

A new agent is one class implementing reset() and step(state); everything below the step function (execution, sandboxing, observation formatting) is the platform's job. The paper's minimal agent, condensed (reference template, unexecuted):

# Reference template (paper Fig. 3, condensed): an OpenHands agent is a step function.
class MinimalAgent:
    def reset(self) -> None:
        self.system_message = "You are a helpful assistant..."

    def step(self, state: State) -> Action:
        messages = [{"role": "system", "content": self.system_message}]
        for prev_action, obs in state.history:
            messages += [get_action_message(prev_action), get_observation_message(obs)]
        action = self.parse_response(self.llm.do_completion(messages))
        if self.is_bash_command(action):
            return CmdRunAction(command=action.command)
        if self.is_python_code(action):
            return IPythonRunCellAction(code=action.code)
        if self.is_browser_action(action):
            return BrowseInteractiveAction(code=action.code)
        return MessageAction(content=action.message)

Three extension paths keep contributions cheap. Micro-agents reuse a generalist implementation (usually CodeActAgent) with a specialized prompt, so a working prompt for a niche task is shareable as an agent. AgentSkills accepts a new tool only when an LLM cannot readily achieve the effect by writing plain code, or the tool wraps an external model (the library deliberately does not re-wrap pandas so agents can be re-taught what they already know). Delegation composes specialists without new plumbing, since handoff is an action type.

How to maintain it¶

Agents regress like any software, and full benchmark runs are too expensive to gate every change (the paper estimates roughly $600 for one SWE-bench Lite evaluation with gpt-4o, and $6.9K for the full 2,294-instance set at a conservative $3 per instance). OpenHands' answer is agent integration tests: end-to-end tasks with gold outputs where every LLM call is intercepted and answered from stored prompt-response pairs keyed on exact prompt match, giving deterministic, near-free CI across platforms and sandbox types; substantial prompt changes regenerate the stored pairs against a real model. For runtime upkeep, prefer the hash-based image tag (content-addressed, reproducible) over the generic per-version tag (mutable) whenever results must be comparable across time, and rebuild images through the documented build path so the action-execution server version matches the backend.

Running it in production¶

Treat every action the agent emits as untrusted code execution, because it is: arbitrary bash and Python run inside the session sandbox by design. The platform's defaults (per-session Docker isolation, a REST-only channel between backend and sandbox, a mounted workspace that scopes what the agent can touch) contain mistakes, not attacks; add the sandboxing and isolation layers (user namespaces or microVMs, egress-filtered networking, resource quotas) before exposing it to untrusted inputs, and put an identity and access boundary on any credentials mounted into the workspace. Operationally, the event stream is the audit log: persist it, alert on cost accumulation (cost lives in the state, so budget enforcement has a natural hook), and cap delegation depth. Keep humans in the loop through the chat UI for anything irreversible; the platform was explicitly designed for interruption and feedback mid-run. Benchmark before and after model or platform upgrades with the built-in harness on a fixed image hash, since agent regressions rarely announce themselves in unit tests (evaluating agents).

Failure modes¶

Long-file editing. The paper's own limitations section flags that agents struggle when editing long files; the skills library mitigates with line-scoped edit_file plus scrolling, but expect edit failures on large sources.
Stale benchmark readings. The headline numbers (SWE-bench Lite 26.0% with claude-3-5-sonnet, v1.8, no hints; HumanEvalFix 79.3% 0-shot with gpt-4o; WebArena 15.5%; MiniWoB++ 40.8%; GPQA diamond 52.0%; GAIA L1 32.1 via the GPTSwarm agent; AgentBench OS 57.6%; MINT math 77.3%) are mid-2024 snapshots; quoting them as current capability misstates both the platform and the field.
Mutable image tags. Evaluating against the generic runtime tag lets image contents drift under a stable name; comparisons across weeks silently stop being like-for-like. Pin the hash tag.
Delegation without context. AgentDelegateAction hands off a subtask description, not judgment; a generalist delegating to a specialist inherits the specialist's blind spots, and on WebArena delegation scored slightly below the specialist run directly (14.5 vs 14.8 with gpt-4o).
Sandbox overconfidence. Docker-level isolation with a mounted workspace means a prompt-injected agent can still destroy or exfiltrate whatever is mounted; scope the mount and the network, and see prompt-injection defense.
Project drift. The repository's current product surface (control center, SDK, cloud) is a different shape from the paper architecture; code written against V0 interfaces needs the legacy path, and claims about "OpenHands" should say which layer they mean.

References¶

Wang et al., OpenHands: An Open Platform for AI Software Developers as Generalist Agents (arXiv 2407.16741): https://arxiv.org/abs/2407.16741
Hugging Face paper page (community discussion and links): https://huggingface.co/papers/2407.16741
OpenHands repository (current project state; MIT license): https://github.com/All-Hands-AI/OpenHands
OpenHands documentation (current entry points: Agent Canvas, SDK, CLI, cloud, legacy V0): https://docs.openhands.dev
All Hands AI documentation (paper-era docs domain): https://docs.all-hands.dev
Wang et al., Executable Code Actions Elicit Better LLM Agents (CodeAct, the basis of CodeActAgent, arXiv 2402.01030): https://arxiv.org/abs/2402.01030
Jimenez et al., SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (arXiv 2310.06770): https://arxiv.org/abs/2310.06770

OpenHands paper (arXiv 2407.16741), sections 1-2: event stream as chronological action/observation history plus auxiliary state (LLM cost, delegation metadata); actions IPythonRunCellAction, CmdRunAction, BrowseInteractiveAction; per-session Docker sandbox with a REST action-execution API hosting bash, Jupyter IPython, and Playwright Chromium; MIT license; 2.1K+ contributions from 188+ contributors and 32K stars as of the paper. ↩↩
OpenHands paper, section 4 (15 benchmarks, tables 3-6): CodeActAgent v1.8 SWE-bench Lite 26.0% (claude-3-5-sonnet, no hints, $1.10 avg/instance) and 22.0% (gpt-4o); HumanEvalFix 79.3% (v1.5, gpt-4o, 0-shot); BrowsingAgent v1.0 WebArena 15.5% (claude-3-5-sonnet), MiniWoB++ 40.8% (gpt-4o); GPQA diamond 52.0% (claude-3-5-sonnet); GAIA L1 32.1 (GPTSwarm v1.0, gpt-4o); AgentBench OS 57.6% and MINT math 77.3% (v1.5, gpt-4o). All without benchmark-specific prompt engineering; models are mid-2024. ↩