Running local coding agents¶
Scope: standing up a coding agent that runs entirely on your own hardware with open-weight models, as an alternative to hosted services. This page covers what a local coding agent is (a local model server, an OpenAI-compatible endpoint, and a coding harness pointed at it), why and when to run one, how to size a model to your memory and wire a harness to it, how to reach a bigger box over SSH, and how to audit tool access before you grant it. It builds on open-weight serving, the consumer-GPU cookbook, DGX Spark, and the agent harness.
Serving commands and harness configs are reference templates on real tools (Ollama, vLLM, Qwen-Code, Codex CLI); pin versions and validate on your machine. The Python example is executed and asserted (stdlib only); it validates the memory-fit math, not any specific model or box.
flowchart LR
DEV["Editor / terminal (laptop)"] -->|"SSH tunnel (optional)"| HARNESS["Coding harness: Qwen-Code / Codex / Claude Code"]
HARNESS -->|"OpenAI-compatible API"| SERVER["Local model server: Ollama or vLLM"]
SERVER --> MODEL["Open-weight coder (fits local memory at chosen quant)"]
HARNESS -->|"shell + file tools (audit and sandbox first)"| REPO["Your repository"]
What it is¶
A local coding agent is three pieces you own end to end: an open-weight model (a code-tuned LLM), a local server that runs it and exposes an OpenAI-compatible /v1/chat/completions endpoint (Ollama for the simplest path, vLLM for throughput), and a coding harness (Qwen-Code, OpenAI's Codex CLI, or Claude Code pointed at a custom base URL) that turns the model into an agent by giving it shell and file-editing tools inside your repository. The harness is the same class of program covered in agent harness architecture; the only change from a hosted setup is that its base URL points at your local server instead of a vendor API. The agent can run on the same machine as your editor or on a bigger box you reach over an SSH tunnel.
Why use it¶
- No data egress. Your code and prompts never leave your machine, which is the deciding factor for proprietary or regulated codebases.
- No per-token cost. Once the hardware is paid for, inference is free, so high-volume or long-running agent loops do not accrue an API bill.
- Offline and under your control. No rate limits, no vendor deprecation of a model you depend on, and you can pin an exact model and harness version.
- Open-weight coders are now capable enough. Code-tuned open models (for example the Qwen3-Coder line) reach a level where, wrapped in a good harness, they handle real edit-run-fix loops rather than just autocomplete.
When to use it (and when not)¶
- Use it when privacy, cost, or offline operation matter and a capable open coder fits your hardware, or when you want a reproducible, pinned agent that a vendor cannot change under you.
- Prefer a hosted frontier model when the task needs capability the local model does not have; below a coding-capability floor no harness rescues it (harness architecture).
- Do not under-provision memory. A model that does not fit at a usable precision will swap or fail; size it first (below).
- Do not grant tool access blindly. A local agent with shell and file tools can delete or exfiltrate; audit and sandbox before you let it run commands (below).
Architecture¶
The request path is editor to harness to local server to model, with the harness also holding the tool channel into your repository. Two deployment shapes matter: on-box, where the model server and the harness run on the same machine (a workstation or a unified-memory Mac); and remote, where the model runs on a bigger box and your laptop reaches it over an SSH tunnel (ssh -L 11434:localhost:11434 user@box) so the harness still talks to localhost. The load-bearing constraint in both is memory: the model's weights plus its KV cache must fit in the device's RAM or VRAM at the precision you choose.
How to use it¶
Start by checking the model fits. Weights memory is params × bits/8, and for a mixture-of-experts model that is the total parameter count (all experts are resident), not the active count. This runnable check is executed and asserted, including the adversarial MoE case that trips people up:
# local_fit.py — validated: will an open-weight coder fit in local memory at a given precision? stdlib only.
def weights_gb(params_b, bits):
return params_b * 1e9 * (bits / 8) / 1e9 # bytes = params * bits/8
def fits(params_b, bits, device_gb, overhead=0.35): # leave headroom for KV cache, runtime, and OS
return weights_gb(params_b, bits) < device_gb * (1 - overhead)
assert round(weights_gb(30, 16), 1) == 60.0 # a 30B model in BF16 is ~60 GB
assert round(weights_gb(30, 4), 1) == 15.0 # the same model 4-bit quantized is ~15 GB
assert fits(30, 4, 32) and not fits(30, 16, 32) # 4-bit fits a 32 GB box; BF16 does not
assert not fits(30, 4, 16) # too big for a 16 GB laptop even at 4-bit
# adversarial: MoE MEMORY is set by TOTAL params, not active params (the common sizing mistake).
assert weights_gb(235, 4) > 100 and not fits(235, 4, 64) # a 235B MoE 4-bit is ~118 GB regardless of active count
print(f"30B@4bit/32GB={fits(30,4,32)} 30B@bf16/32GB={fits(30,16,32)} 235B@4bit/64GB={fits(235,4,64)}")
Once a model fits, serve it and point a harness at it (reference templates; pin versions):
# Ollama: pull a code model and expose the OpenAI-compatible endpoint on localhost:11434
ollama pull qwen3-coder # any open coder that fits your memory budget
ollama serve # serves /v1/chat/completions
# Point a coding harness at the local endpoint (OpenAI-compatible base URL + any placeholder key)
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=local
qwen-code # or: codex (OpenAI Codex CLI). Claude Code speaks the Anthropic Messages API,
# so it needs ANTHROPIC_BASE_URL (Ollama's Anthropic-compatible endpoint) or a proxy, not OPENAI_*
How to develop with it¶
The core tradeoff is capability against memory, mediated by quantization. Pick the largest code model that fits at a precision that does not wreck code quality: 4-bit is the usual sweet spot for fitting a 30B-class coder on a 32 GB machine, but aggressive quantization degrades exactly the long, exact token sequences code needs, so validate on your own tasks rather than a chat benchmark. Prefer a code-tuned model over a general one at the same size. Choose the harness for the tools and ergonomics you want (Qwen-Code and Codex CLI speak the OpenAI-compatible API; Claude Code speaks the Anthropic Messages API, so point it at Ollama's Anthropic-compatible endpoint or a translation proxy), and remember the harness drives more of the agent's reliability than a marginal model-size bump once you are past the capability floor. For throughput or multiple concurrent sessions, move from Ollama to vLLM with the same OpenAI endpoint.
How to maintain it¶
Pin the model tag and the harness version together and re-run a small coding eval after any bump, because a quantization or harness change can silently regress the edit-run-fix loop (eval harness). Watch context length on smaller models: a local coder with a modest window fills up on a large repo, so lean on the harness's context management (compaction, file-backed state) rather than pasting whole files. On a unified-memory box, monitor that the model plus KV cache plus your other applications stay within RAM, since spilling to swap collapses tokens/sec. Keep the served model on the fast local fabric; if you run remote, keep the SSH tunnel and the box's own resource limits in view.
How to run it in production¶
For a single developer, "production" is a reliable daily driver on real hardware. Unified-memory machines (an M-series Mac, or an DGX Spark-class desktop) are attractive because CPU and GPU share one memory pool, so a 4-bit 30B-class coder that would not fit a small discrete GPU runs comfortably. For a bigger model, run it on a workstation or server and reach it over an SSH tunnel so your laptop stays thin. The non-negotiable production step is a security audit before granting tool access: a coding agent with shell and file-write tools can delete files, leak secrets, or be steered by a prompt-injection payload in a repo it reads, so run it against a sandboxed working copy with constrained egress and least-privilege file scope before pointing it at anything real (sandboxing and isolation, prompt-injection defense). Treat the local endpoint as untrusted-adjacent: bind it to localhost, not a public interface.
Failure modes¶
- MoE sized by active params. Sizing a mixture-of-experts model by its active count under-provisions memory by an order of magnitude; memory is set by total params (the adversarial check above).
- Over-aggressive quantization. Pushing a coder to very low bit-widths degrades the exact long token sequences code needs; validate code quality, not chat scores, at your chosen precision.
- Ungated tool access. Granting shell and file-write tools without a sandbox invites data loss or exfiltration, and makes prompt injection in a read repo dangerous; audit and sandbox first.
- Swapping to disk. A model plus KV cache that exceeds RAM spills to swap and throughput collapses; leave headroom.
- Small-context overflow. A modest-window local model loses the thread on a large repo; use harness compaction and file-backed state.
- Exposed endpoint. Binding the local server to a public interface turns it into an open, unauthenticated inference service; keep it on localhost or behind the SSH tunnel.
References¶
- Raschka, "Using Local Coding Agents": https://magazine.sebastianraschka.com/p/using-local-coding-agents
- Ollama (local model server with an OpenAI-compatible API): https://ollama.com
- Qwen-Code (open coding harness): https://github.com/QwenLM/qwen-code
- OpenAI Codex CLI: https://github.com/openai/codex
- vLLM OpenAI-compatible server: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
Related: Serving open-weight models · vLLM on consumer GPUs · DGX Spark · Agent harness architecture · Self-improving harnesses · Agent sandboxing and isolation · Prompt-injection defense · Quantization for inference · Glossary