Skip to content
Markdown

Cookbook: serve Ornith-1.0 with vLLM

Scope: a vLLM reference template for serving deepreinforce-ai/Ornith-1.0-397B, DeepReinforce's self-scaffolding agentic coding flagship: what the Ornith-1.0 family is (open-weight coders trained with RL to author the harness that drives their own rollouts), why the 397B MoE matters (vendor-reported 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-bench Verified, ahead of Claude Opus 4.7 on both), how to launch the 8-GPU tensor-parallel shape with the qwen3_xml/qwen3 parsers, and how to verify reasoning and tool calls.

Reference template. Commands and manifests are not executed here; the sizing arithmetic is (see the fit check below). Ornith-1.0 needs recent runtimes (Transformers >= 5.8.1, vLLM >= 0.19.1, SGLang >= 0.5.9). The model card's quickstart claims "a single 8x80GB GPU node"; the weight math shows that only holds for the FP8 checkpoint, and the 397B card also carries boilerplate from the 9B card ("lightweight member ... single-GPU deployment"), so trust the commands and the config, not the card prose. Pin the exact repo, revision, and vLLM image, and validate on one node before production.

flowchart LR
  AGENT["Coding agent (OpenHands / OpenCode / Claude Code)"] --> API["vLLM OpenAI endpoint, TP=8"]
  API --> HYBRID["60 layers: linear attention, full attention every 4th"]
  HYBRID --> MOE["512-expert MoE, 10 routed + 1 shared per token"]
  MOE --> PARSERS["qwen3 reasoning parser + qwen3_xml tool parser"]
  PARSERS --> OUT["reasoning_content + tool_calls"]

What

Ornith-1.0 is DeepReinforce's first model release (June 2026): an MIT-licensed family of open-weight models post-trained for agentic coding on top of pretrained Gemma 4 and Qwen 3.5. The announced sizes are 9B dense, 31B dense, 35B MoE, and 397B MoE. As of 2026-07-03 the Hugging Face org publishes the 9B, 35B, and 397B checkpoints plus Ornith-1.0-35B-FP8, Ornith-1.0-397B-FP8, and GGUF conversions of the 9B and 35B; there is no 31B repo yet.

The headline is the training method, not the architecture. Standard agentic RL fixes a human-designed harness per task category and optimizes only the policy inside it. Ornith-1.0 treats the scaffold as a learnable object that co-evolves with the policy. Each RL step runs in two stages: conditioned on the task and the scaffold previously used for it, the model proposes a refined scaffold; conditioned on that scaffold and the task, it generates a solution rollout. Reward from the rollout propagates to both stages, so the model is optimized to author the orchestration that elicits better answers, not just the answers. Over training this becomes a mutate-and-select loop in which per-task-category strategies emerge without hand-engineered harness design. This is the trained-in version of the harness-search loop covered in automated harness optimization and self-improving harnesses.

flowchart LR
  TASK["Task + previous scaffold"] --> S1["Stage 1: model refines the scaffold"]
  S1 --> S2["Stage 2: solution rollout under the new scaffold"]
  S2 --> VER["Verifier reward"]
  VER --> MON["Deterministic monitor + frozen LLM judge veto"]
  MON --> UPD["Staleness-weighted token-level GRPO update to both stages"]
  UPD --> TASK

Letting the model author its own scaffold invites reward hacking (reading visible test files and hardcoding expected artifacts, or copying an oracle solution present in the environment), and DeepReinforce documents a three-layer defence: the outer trust boundary (environment, tool surface, test isolation) is immutable, so the model evolves only its inner policy scaffold (memory, error handling, orchestration); a deterministic monitor flags reads of withheld paths, edits to verification scripts, or off-surface actions, assigning those trajectories zero reward and excluding them from the advantage computation; and a frozen LLM judge acts as a veto on top of the verifier rather than as the primary reward. The same failure taxonomy and the boundary-vs-judge split appear in evaluation integrity and the reward-function sandboxing runbook.

The RL systems side is a pipeline-RL (asynchronous) setup: long rollouts continue while the trainer updates, and a staleness weight handles the resulting off-policy tokens. Tokens younger than a threshold K1 keep full weight, tokens between K1 and K2 decay exponentially, and tokens older than K2 are dropped; the weight multiplies a token-level GRPO-style clipped objective with asymmetric clipping bounds. This is the same staleness-control family described in async RL systems and GRPO variants.

Architecturally, the 397B config identifies as a Qwen 3.5 MoE derivative (qwen3_5_moe): 60 layers with hybrid attention (three linear-attention layers, then one full-attention layer, full_attention_interval: 4), GQA with 32 query heads and 2 KV heads at head_dim 256, partial RoPE (factor 0.25, theta 1e7), 512 routed experts with 10 active per token plus one shared expert, hidden size 4096, vocabulary 248,320, and 262,144-token context. DeepReinforce does not publish an active-parameter count for the 397B. The config also carries the base's vision encoder and a single MTP (multi-token prediction) layer; the card tags the release text-generation, all published evals are coding tasks, and no speculative-decoding recipe is documented, so treat both as unvalidated for this checkpoint. It is a reasoning model: the assistant turn opens with a <think> ... </think> block, which the qwen3 reasoning parser moves into reasoning_content.

Vendor-reported benchmarks for the flagship (higher is better):

Benchmark Ornith-1.0-397B Qwen3.5-397B (base) GLM-5.2-744B Claude Opus 4.7 Claude Opus 4.8
Terminal-Bench 2.1 (Terminus-2) 77.5 53.5 81.0 70.3 85
Terminal-Bench 2.1 (Claude Code) 78.2 48.6 82.7 69.7 78.9
SWE-bench Verified 82.4 76.4 - 80.8 87.6
SWE-bench Pro 62.2 51.6 62.1 64.3 69.2
SWE-bench Multilingual 78.9 69.3 - - -
NL2Repo 48.2 36.8 48.9 - 69.7
ClawEval average 77.1 70.7 - 78.2 -

Read the table against the harness footnotes: Terminal-Bench 2.1 ran under two harnesses (Harbor/Terminus-2, and Claude Code 2.1.126), temperature 1.0, averaged over 5 runs, with the Qwen chat template modified for train/inference consistency and Harbor patched to consume vLLM's reasoning_content; SWE-bench used the OpenHands harness at 256K context. The cleanest signal is the delta over its own base at the same scale: +24 points on Terminal-Bench 2.1 and +6 on SWE-bench Verified over Qwen3.5-397B. Down the family, Ornith-1.0-35B reports 64.2 on Terminal-Bench 2.1 (above the 397B base model's 53.5) and 75.6 on SWE-bench Verified; the 9B reports 43.1 and 69.4. Two source caveats: the announcement's prose and its tables disagree in places (64.4 vs 64.2 for the 35B on TB 2.1; 66.0/67.9 quoted for MiniMax-M3 and DeepSeek-V4-Pro against 64 in the tables), and this page uses the tables; all numbers are vendor-reported and unreplicated.

Why

Use Ornith-1.0-397B when the product is a coding agent and you want the strongest open-weight coder that fits a single 8-GPU node in FP8. The MIT license has no use restrictions, the model ships with an OpenAI-compatible tool-calling and reasoning path that standard harnesses consume directly, and the vendor-reported scores put it ahead of Claude Opus 4.7 and its own Qwen 3.5 base on the main agentic coding benchmarks. The hybrid attention stack keeps the KV cache small (only every fourth layer holds KV: about 30 KiB per token, roughly 7.5 GiB for a full 256K sequence), so long agent sessions are cheap on memory compared to a full-attention model of this scale. The family spans 9B (edge, GGUF available) to 397B on the same post-training recipe, which makes a capability ladder practical: prototype a harness against the 9B or 35B locally, promote to 397B for production.

When

Use it when:

  • Agentic coding, terminal tasks, or repository-scale editing dominate and you want an open-weight, MIT-licensed endpoint.
  • You can serve FP8 on an 8x80GB node (or BF16 on 8x H200-class GPUs) and run vLLM 0.19.1+ / SGLang 0.5.9+.
  • Your harness consumes reasoning_content and structured tool_calls (OpenHands, OpenCode, Claude Code-style CLIs).

Avoid it when:

  • You expect the "writes its own harness" property at inference time. Self-scaffolding is a training-time method; the released checkpoints are ordinary chat models and the production harness is still yours to choose, pin, and version.
  • You need reproduced, third-party benchmark numbers; everything above is vendor-reported and recent.
  • The platform is stuck on older runtimes (qwen3_5_moe needs Transformers 5.8.1+ / vLLM 0.19.1+) or you cannot allocate 8 GPUs; use the 35B (FP8/GGUF) or 9B variants instead.
  • The workload is general chat or multimodal; the vision tower is untested in this release and the post-training is coding-specific.

How

1. Size the replica

Item Practical note
Model deepreinforce-ai/Ornith-1.0-397B (BF16) or deepreinforce-ai/Ornith-1.0-397B-FP8
Params 397B MoE; 10 of 512 experts + 1 shared per token; active count not published
Architecture qwen3_5_moe: 60 layers, full attention every 4th layer, GQA 2 KV heads x head_dim 256, 1 MTP layer
Context 262,144 tokens
KV cache ~30 KiB/token BF16 (15 full-attention layers only); ~7.5 GiB per 256K sequence
Precision BF16 or FP8 (E4M3)
License MIT
Reasoning <think> block on by default; --reasoning-parser qwen3
Tool calls --tool-call-parser qwen3_xml (vLLM) / qwen3_coder (SGLang)
Sampling card examples: temperature 0.6, top_p 0.95, top_k 20
Starting hardware FP8: single 8x80GB node (TP=8); BF16: 8x H200 (141GB) or a 16-GPU 80GB shape
Runtimes Transformers >= 5.8.1, vLLM >= 0.19.1, SGLang >= 0.5.9

First-order fit check (executed; ignores activation and CUDA-graph overhead, so treat the margins as the minimum):

GIB = 1024**3

params = 397e9
bf16, fp8 = params * 2, params * 1

# 60 layers, full attention every 4th; only those layers hold KV cache.
# Per token: 2 KV heads x 256 head_dim x (K+V) x 2 bytes (BF16).
kv_layers = 60 // 4
kv_per_token = kv_layers * 2 * 256 * 2 * 2
kv_seq = kv_per_token * 262_144

node_8x80 = 8 * 80 * GIB * 0.90      # the card's quickstart shape
node_8xh200 = 8 * 141 * GIB * 0.90

assert kv_layers == 15 and kv_per_token == 30_720
assert bf16 > node_8x80                # BF16 does not fit 8x80GB
assert bf16 + kv_seq < node_8xh200     # BF16 fits 8x H200
assert fp8 + kv_seq < node_8x80        # FP8 fits 8x80GB, full 256K sequence included

print(f"BF16 {bf16/GIB:.0f} GiB vs 8x80GB budget {node_8x80/GIB:.0f} GiB")
print(f"FP8 {fp8/GIB:.0f} GiB, KV per 256K sequence {kv_seq/GIB:.1f} GiB")

Output: BF16 739 GiB vs 8x80GB budget 576 GiB and FP8 370 GiB, KV per 256K sequence 7.5 GiB. So on 8x80GB deploy the FP8 checkpoint; the BF16 repo wants H200-class HBM or 16 GPUs.

2. Bare-metal vLLM server (FP8, 8 GPUs)

The card's launch command, pointed at the FP8 checkpoint:

vllm serve deepreinforce-ai/Ornith-1.0-397B-FP8 \
    --served-model-name Ornith-1.0-397B \
    --tensor-parallel-size 8 \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 \
    --trust-remote-code

Keep --enable-prefix-caching: agent turns re-send a growing prefix, and prefix caching is what keeps TTFT flat over a long session. If your vLLM build refuses automatic prefix caching for hybrid (linear-attention) models, drop the flag and re-measure TTFT before accepting the regression. Cap --max-model-len to the route's real need; 262144 is the ceiling, not a default to expose.

SGLang equivalent (note the different tool-parser name):

python -m sglang.launch_server \
    --model-path deepreinforce-ai/Ornith-1.0-397B-FP8 \
    --served-model-name Ornith-1.0-397B \
    --tp 8 \
    --host 0.0.0.0 --port 8000 \
    --context-length 262144 \
    --mem-fraction-static 0.85 \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

3. Kubernetes Deployment template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-ornith-1-0
  namespace: serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-ornith-1-0 }
  template:
    metadata:
      labels: { app: vllm-ornith-1-0 }
    spec:
      nodeSelector:
        accelerator.nvidia.com/class: h100-8gpu
      containers:
        - name: vllm
          image: vllm/vllm-openai:<tested-0.19.1+-tag>
          args:
            - --model=deepreinforce-ai/Ornith-1.0-397B-FP8
            - --served-model-name=Ornith-1.0-397B
            - --tensor-parallel-size=8
            - --max-model-len=262144
            - --gpu-memory-utilization=0.90
            - --enable-prefix-caching
            - --enable-auto-tool-choice
            - --tool-call-parser=qwen3_xml
            - --reasoning-parser=qwen3
            - --trust-remote-code
          ports:
            - { containerPort: 8000, name: http }
          resources:
            limits: { nvidia.com/gpu: 8 }
            requests: { nvidia.com/gpu: 8 }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 240
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-ornith-1-0
  namespace: serving
spec:
  selector: { app: vllm-ornith-1-0 }
  ports:
    - { name: http, port: 8000, targetPort: 8000 }

4. Reasoning and tool-call smoke test

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Ornith-1.0-397B",
    "messages": [{"role": "user", "content": "Use the run_shell tool to count the Python files under src/, then report the number."}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "run_shell",
        "description": "Run a shell command and return its output.",
        "parameters": {
          "type": "object",
          "properties": {"command": {"type": "string"}},
          "required": ["command"]
        }
      }
    }],
    "tool_choice": "auto",
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "max_tokens": 2048
  }'

Pass criteria:

  • message.reasoning_content carries the <think> trace and message.content holds only the answer; raw <think> tags leaking into content mean the qwen3 reasoning parser is missing.
  • The response contains a structured tool_calls entry, not prose describing a tool call.
  • The served chat template is the repo's chat_template.jinja (DeepReinforce modified the stock Qwen template for train/inference consistency; the benchmark numbers were measured with it).
  • Sampling follows the card examples (temperature 0.6, top_p 0.95, top_k 20) unless you re-tune against your own harness eval.

5. Plug in a coding agent

The endpoint is OpenAI-compatible, so the card's harness integrations are environment variables away, for example OpenHands:

pip install openhands-ai

export LLM_MODEL="openai/Ornith-1.0-397B"   # must match --served-model-name, not the HF repo id
export LLM_BASE_URL="http://<service-url>:8000/v1"
export LLM_API_KEY="EMPTY"

openhands

OpenCode registers the endpoint as an @ai-sdk/openai-compatible provider in ~/.config/opencode/opencode.json; OpenClaw and Hermes read OPENAI_BASE_URL/OPENAI_API_KEY. Pin the harness version alongside the model revision: the vendor's own footnotes show scores moving several points between harnesses on the same checkpoint (77.5 under Terminus-2 vs 78.2 under Claude Code), so a silent harness upgrade is a silent eval change (local coding agents, agent evaluation).

Failure modes

  • BF16 checkpoint on an 8x80GB node: OOM at load time; the weights alone are ~740 GiB against a 576 GiB budget. Use Ornith-1.0-397B-FP8 on 8x80GB, or move BF16 to 8x H200 / 16 GPUs.
  • qwen3_5_moe architecture unsupported: vLLM older than 0.19.1, SGLang older than 0.5.9, or Transformers older than 5.8.1. Upgrade; there is no fallback path.
  • Tool calls emitted as text: missing --enable-auto-tool-choice or the wrong parser name; it is qwen3_xml on vLLM but qwen3_coder on SGLang.
  • reasoning_content empty or <think> in the answer: missing --reasoning-parser qwen3, or a client that discards the extra field. Harbor needed a patch to read vLLM's reasoning_content; check your harness does too.
  • Benchmark numbers do not reproduce: chat-template drift. Scores were measured with the repo's modified chat_template.jinja, specific harness versions (Claude Code 2.1.126, patched Harbor), temperature 1.0, and 5-run averages. A stock Qwen template or a different harness build measures a different system.
  • Weak multi-step behaviour at greedy decoding: the card's examples sample at temperature 0.6, top_p 0.95, top_k 20, and the vendor evals ran at 1.0. Do not serve greedy by default.
  • Concurrency bound earlier than KV math suggests: KV is tiny here (7.5 GiB per 256K sequence), so activation memory and --max-num-seqs become the practical limits; profile before raising concurrency, and see KV cache management for the gauge to watch.
  • Expecting self-scaffolding in production: the released weights do not rewrite your harness at inference. If you want scaffold search on your own tasks, that is a training/eval-loop activity (automated harness optimization) with its own reward-hacking controls (evaluation integrity).

References

  • Ornith-1.0 announcement (method, staleness-weighted GRPO, full benchmark tables): https://deep-reinforce.com/ornith_1_0.html
  • Ornith-1.0-397B model card: https://huggingface.co/deepreinforce-ai/Ornith-1.0-397B
  • Ornith-1.0-397B-FP8: https://huggingface.co/deepreinforce-ai/Ornith-1.0-397B-FP8
  • DeepReinforce Hugging Face org (all published variants): https://huggingface.co/deepreinforce-ai
  • Modified chat template used for training and evals: https://huggingface.co/deepreinforce-ai/Ornith-1.0-397B/blob/main/chat_template.jinja
  • Simon Willison's notes (35B GGUF hands-on): https://simonwillison.net/2026/Jun/29/ornith/
  • Release walkthrough video ("Ornith 1.0", Prompt Engineering): https://www.youtube.com/watch?v=25j4kMGhKGw
  • vLLM tool calling and reasoning parsers: https://docs.vllm.ai/en/latest/features/tool_calling/

Related: Open-weight serving · Inference serving · Local coding agents · Self-improving harnesses · Automated harness optimization · GRPO variants · Async RL systems · Evaluation integrity · KV cache management · Glossary