How I Route AI Agents Through a Local Model Proxy

Posted on Tue 21 April 2026 in Thought, AI, Security Research

This is a follow-up to my previous post where I covered reducing token costs in a multi-agent pipeline. That post touched on local model fallback at a high level. This one goes deeper on how the routing layer actually works.

The Pipeline

I have five agents, split across two tiers based on what they actually do:

Mechanical (Haiku):

step1-2-analyst - reads commit diffs, classifies bug class, produces structured output
step4-explorer - grep-based sibling hunt across the codebase

Reasoning (Sonnet):

step3-attacker - adversarial bypass analysis, writes and executes attack probes
probe-builder - builds PoCs from confirmed findings
v8-intel - V8 commit analysis, attack surface assessment

The split comes down to what each agent actually does. step1-2 reads a diff and fills structured fields that's pattern matching and formatting, not deep reasoning. step3 needs to read source files, trace execution paths, construct attack scenarios, and write runnable exploit code. Those are not the same job and they shouldn't run on the same model.

LiteLLM as a Transparent Proxy

Claude Code respects the ANTHROPIC_BASE_URL environment variable. Point it at a local LiteLLM server instead of Anthropic directly and all agent requests go through the proxy first:

ANTHROPIC_BASE_URL=http://localhost:4000 claude

LiteLLM runs a local server that speaks the Anthropic API format. It receives a request, looks up the model name in its config, and routes to whatever backend is mapped there. By default everything passes through to Anthropic unchanged and the proxy is transparent.

The config maps model names to backends:

model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: claude-haiku-4-5-20251001
    litellm_params:
      model: anthropic/claude-haiku-4-5-20251001
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: local-sonnet
    litellm_params:
      model: openai/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
      api_base: http://127.0.0.1:8081/v1
      api_key: sk-dummy

  - model_name: local-haiku
    litellm_params:
      model: openai/gemma-4-E4B-it-Q8_0.gguf
      api_base: http://127.0.0.1:8080/v1
      api_key: sk-dummy

litellm_settings:
  drop_params: true

One thing worth noting: LiteLLM needs its own Anthropic API key to make outbound calls to Anthropic's REST API. This is separate from Claude Code's internal auth. Claude Code authenticates via OAuth but LiteLLM calls the raw REST API which expects a real sk-ant-... key. Get one from console.anthropic.com and pass it when starting LiteLLM:

ANTHROPIC_API_KEY=sk-ant-... litellm --config ~/models/config.yaml --port 4000

Per-Agent Model Assignment

Each agent in Claude Code is a markdown file with a YAML frontmatter block. Model assignment is one line:

---
model: haiku
---

Claude Code reads this when spawning the agent. LiteLLM maps haiku to claude-haiku-4-5-20251001 and routes to Anthropic. Simple.

Switching to Local Models

When a workload triggers usage policy flags on hosted models, which happens in security research, I can switch the affected agent to local by changing its frontmatter:

---
model: local-sonnet
---

LiteLLM maps local-sonnet to Qwen3.6-35B running on llama-server at port 8081. The agent doesn't change at all. Tool use still works. The only difference is where the inference actually runs.

I run two llama-server instances, one per tier:

# Gemma-4B — mechanical tier
llama-server -m ~/models/gemma-4-E4B-it-Q8_0.gguf --port 8080 --ctx-size 8192

# Qwen3.6-35B — reasoning tier
llama-server -m ~/models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --port 8081 --ctx-size 16384

Same model tier split as the hosted setup, just local. You can also configure automatic fallback in LiteLLM so when Anthropic flags a request the pipeline retries locally without any manual intervention. Though that's only useful if your llama-server is already running.

The Tradeoffs

A few things I ran into that are worth noting:

LiteLLM adds a dependency. It's another process to manage with its own Claude Code API key separate from Claude Code's auth.

Local models can break output contracts. My pipeline agents produce structured output between delimiter markers. Qwen3 is good at tool use but formats output differently than Sonnet and Opus. After switching to local you need to verify your delimiter formats still hold, otherwise downstream agents can silently drop the output.

drop_params: true is required. Local models reject Anthropic-specific parameters like thinking budgets and cache control. LiteLLM strips them before forwarding, which is what that setting does.

Qwen3.6-35B needs ~25GB RAM at Q4_K_M quantization. If you're on a machine where that's tight, this tier won't work.

Why This Architecture

The goal was resilience without adding complexity to the agents themselves. Normal operation hits Anthropic with no changes. When a specific agent gets flagged, one frontmatter line and a re-spawn switches it to local. The rest of the pipeline keeps running on Anthropic unchanged.

No separate codepaths. No conditional logic inside agent prompts. The proxy handles routing, agents just specify a model name and let the config decide where it goes.

If you are running multi-agent research workflows and hitting policy walls, this setup is worth the one-time configuration cost. Feel free to reach out if you want to compare notes.