Language Media LTDDigital Marketing & Data Science in LondonBuilding a Self-Hosted Coding Agent: Architecture, Tool Loops and Local LLMs

What Is a Coding Agent?

A coding agent is an AI system that does not just answer questions about code — it takes actions on code. It reads files, writes new ones, edits existing content, runs shell commands, searches a codebase, installs packages, runs tests, and iterates until the task is done. You give it a goal; it figures out the steps.

Commercial coding agents like GitHub Copilot Workspace, Cursor and Devin run on external servers, send your code to third-party APIs, and charge per use. This agent runs entirely on your own hardware, using a locally-served language model, with no data leaving your infrastructure.

The Architecture

The Tool Loop

The agent operates as a loop. At each step, the language model receives the full conversation history — the user’s task, all previous tool calls, and all tool results — and produces a response. If the response contains a tool call, the tool is executed, the result is appended to the history, and the loop continues. If the response contains no tool call, the agent is done and returns its final answer.

Tool calls use a simple XML format that the model is prompted to follow precisely:

<tool_call>
<name>bash</name>
<input>{"command": "pytest tests/"}</input>
</tool_call>

This format is unambiguous to parse, robust to surrounding prose, and works reliably with 7B-class models that may not support native function calling. The parser extracts the tool name and JSON arguments, executes the tool, and feeds the result back to the model as a user message wrapped in a tool_result tag.

The loop runs for up to 20 iterations. In practice, most tasks complete in 3 to 8 steps.

Available Tools

Six tools cover the full surface area of coding work:

bash: Execute any shell command in the workspace directory. Install packages, run tests, check processes, compile code, call CLIs.
read: Read a file’s full contents. The model is instructed to always read before editing — a constraint that prevents blind overwrites.
write: Create or overwrite a file with new content. Creates parent directories automatically.
edit: Replace one specific occurrence of a string in a file. More precise than write for targeted changes — the model must provide the exact string to find, which forces it to read first.
glob: Find files matching a pattern across the workspace. Used to explore project structure and locate relevant files before reading them.
grep: Search file contents by regex pattern. Finds function definitions, variable references, import statements — anything that requires searching across files rather than reading one at a time.

The Workspace

The agent runs in Docker with the target project directory mounted as /workspace. All file operations are scoped to this directory — the agent cannot read or write outside it. This containment is both a safety measure and a clarity constraint: the agent always knows where it is working.

The bash tool runs commands with the workspace as the current directory and a 60-second timeout per command. Standard output and standard error are both captured and returned to the model.

The Language Model

The brain of the agent is a 7B language model served locally on the Vasari GPU server. The system prompt defines the agent’s role, the tool call format, and a set of operational rules: always read before editing, diagnose errors rather than retrying the same failing call, confirm completion with a summary.

Using a 7B model rather than a frontier model (GPT-4, Claude, Gemini) makes the agent fast, cheap and private. A local 7B model produces a response in a few seconds. It has no per-token cost. And it never sends code outside the network. The trade-off is capability on complex reasoning tasks — but for the practical coding work of reading, editing, running and iterating, a well-prompted 7B model performs reliably.

The API and Streaming Interface

The agent is exposed as a FastAPI HTTP service. The primary endpoint is POST /chat, which accepts a message and an optional session ID. The response is a Server-Sent Events stream — each event in the agent loop is pushed to the client as it happens:

thinking: The model is generating a response (iteration number)
tool_call: A tool is about to be executed (tool name and argument preview)
tool_result: The tool returned a result (tool name and result preview)
response: The agent has finished and is returning its final answer
error: Something went wrong

This streaming architecture means the user interface updates in real time as the agent works — you can watch it read files, run commands and iterate, rather than waiting for a single response to appear after a long pause.

Session Memory

Sessions are maintained in memory by session ID. Within a session, the agent remembers the full conversation history: every user message, every tool call, every result. This allows multi-turn work: you can say build this, then check if the tests pass, then fix the failing test, and the agent carries full context across all three turns. Session history can be cleared via the DELETE /session endpoint to start fresh.

The Browser UI

A minimal dark-themed chat interface is served at the root URL. The UI renders each event type distinctly — tool calls in amber, tool results in green, errors in red, thinking steps in blue — giving a clear real-time view of what the agent is doing at each step. The session badge shows the active session ID. A new session button clears history and starts fresh.

The interface is functional rather than polished: its purpose is to make the agent usable immediately without requiring any client-side setup. API consumers can also connect directly to the SSE stream and build their own interfaces on top.

Why Self-Hosted Matters for a Coding Agent

A coding agent that runs on external infrastructure has two significant problems for professional use. First, it sends your code — potentially proprietary, potentially containing credentials or sensitive business logic — to a third-party server. Second, it incurs costs that scale with usage: the more you use it, the more you pay, which creates friction against using it freely for exploration.

A self-hosted agent running a local model has neither problem. Code stays on your network. There is no marginal cost per query. The model can be swapped as better open-weight models are released. And the system prompt, tools and loop logic are entirely under your control — you can extend the tool set, adjust the constraints, or specialise the agent for a specific codebase or workflow.

This is the same principle behind our PRAG private RAG system and our self-hosted image generation: tools that handle sensitive or high-volume work belong on infrastructure you control.

Extending the Agent

The tool set is intentionally minimal. Adding a new tool requires two things: a Python function in tools.py, and a description in the system prompt. Common extensions include a web fetch tool for pulling documentation, a git tool for reading commit history, a database tool for querying schemas, or a test runner tool that formats results for easy parsing.

The agent can also be specialised by modifying the system prompt. The current prompt defines a general coding assistant. A prompt that instructs the agent to always write tests before implementation, or to follow a specific style guide, or to comment in a particular format, produces an agent with those constraints baked in.

If you are interested in deploying a self-hosted coding agent for your team — customised to your stack, your codebase and your workflow — get in touch. We build and deploy custom LLM systems on private infrastructure.