Fixing Concurrent Agent Slowness in llama-server (and Why I Didn't Switch to vLLM)
Posted on Sat 09 May 2026 in Thought, AI, Security Research • Tagged with chronicles, AI agents, claude, LiteLLM, llama-server, local models, routing
This is a follow-up to what I learned running local models in my agent pipeline. That post covered context sizing and KV cache memory. This one covers what I got wrong about concurrency.
The Problem: Agents Queuing Up
My pipeline runs up to four agents simultaneously, called step1-2, step3, step4 …
Continue reading