Fixing Concurrent Agent Slowness in llama-server (and Why I Didn't Switch to vLLM)

Posted on Sat 09 May 2026 in Thought, AI, Security Research • Tagged with chronicles, AI agents, claude, LiteLLM, llama-server, local models, routing

This is a follow-up to what I learned running local models in my agent pipeline. That post covered context sizing and KV cache memory. This one covers what I got wrong about concurrency.

The Problem: Agents Queuing Up

My pipeline runs up to four agents simultaneously, called step1-2, step3, step4 …


Continue reading

What I Learned Running Local Models in My Agent Pipeline

Posted on Sat 25 April 2026 in Thought, AI, Security Research • Tagged with chronicles, AI agents, claude, LiteLLM, llama-server, local models, routing

This is a follow-up to my previous post on routing agents through LiteLLM. That post covered the architecture. This one covers what broke when I actually ran it.

Claude Code Doesn't Pass Through Arbitrary Model Names

The first thing I got wrong: I assumed model: local-sonnet in agent frontmatter would …


Continue reading

How I Route AI Agents Through a Local Model Proxy

Posted on Tue 21 April 2026 in Thought, AI, Security Research • Tagged with chronicles, AI agents, claude, LiteLLM, llama-server, pipeline, local models, routing

This is a follow-up to my previous post where I covered reducing token costs in a multi-agent pipeline. That post touched on local model fallback at a high level. This one goes deeper on how the routing layer actually works.

The Pipeline

I have five agents, split across two tiers …


Continue reading

How I Cut AI Agent Costs Without Cutting Corners

Posted on Mon 20 April 2026 in Thought, AI, Security Research • Tagged with chronicles, AI agents, claude, pipeline, cost optimization, LLM

Running a multi-agent pipeline for security research gets expensive fast. I have several agents doing sequential analysis work - reading commit diffs, running adversarial bypass analysis, building proof-of-concept exploits. Token costs compound at every step: system prompts, tool schemas, conversation history, and verbose outputs all stack up before a single useful …


Continue reading