What I Learned Running Local Models in My Agent Pipeline
Posted on Sat 25 April 2026 in Thought, AI, Security Research
This is a follow-up to my previous post on routing agents through LiteLLM. That post covered the architecture. This one covers what broke when I actually ran it.
Claude Code Doesn't Pass Through Arbitrary Model Names
The first thing I got wrong: I assumed model: local-sonnet in agent frontmatter would reach LiteLLM as local-sonnet. It doesn't. Claude Code translates frontmatter values to known Anthropic model IDs before the request leaves the client. local-sonnet isn't a recognized name so it falls back to claude-sonnet-4-6. The proxy never sees it.
The fix is to remap a real Anthropic model slot in LiteLLM. I dedicated the opus slot to local Qwen3:
- model_name: claude-opus-4-7
litellm_params:
model: anthropic/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
api_base: http://127.0.0.1:8081
api_key: sk-dummy
Agents that need to run locally get model: opus in their frontmatter. Claude Code sends claude-opus-4-7, LiteLLM intercepts it and routes to Qwen3. I keep a separate agent definition file per role with this frontmatter - same prompt, different model slot.
Use the Anthropic Endpoint, Not the OpenAI One
llama-server now supports the Anthropic Messages API natively at POST /v1/messages. In LiteLLM config, using model: anthropic/... with api_base pointing at the server (no /v1 suffix) routes through this endpoint directly.
The OpenAI-compatible endpoint (openai/...) was noticeably slower in testing - LiteLLM has to translate every streaming chunk between formats. The native Anthropic endpoint skips that entirely.
One requirement: --jinja on llama-server. Tool use through the Anthropic endpoint requires Jinja template processing to be enabled or tool calls don't get structured correctly.
256K Context Will Kill Your M1 Max
This one hit performance hard. I had set -c 262144 on Qwen3.6-35B because that's the model's maximum. On an M1 Max with 64GB unified memory the recommendedMaxWorkingSetSize is 54GB. The math:
- Model weights at Q4_K_M: ~21GB
- KV cache at 256K context, bf16: ~48GB
- Total: ~69GB
That's 15GB over the GPU working set. macOS starts compressing and evicting, generation slows to a crawl.
Dropping to 65K context brings the KV cache to ~12GB and the total to ~33GB - well within the working set. Everything stays resident on the GPU and generation speed is what it should be.
llama-server -m ~/models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
-c 65536 \
-ngl 999 \
--flash-attn on \
-b 4096 \
-ub 2048 \
--fit off \
--cache-type-k bf16 \
--cache-type-v bf16 \
-n 32768 \
--jinja \
--port 8081
Model Swaps Need Actual Testing
I briefly tested Qwen3.5-27B as an alternative. Initial response times were slower than the 35B so I swapped back. I didn't run it long enough to draw real conclusions - I was chasing a tool calling issue in Qwen3.6 and never gave the 3.5 a fair evaluation. Not enough signal to say much about it beyond that.