Fixing Concurrent Agent Slowness in llama-server (and Why I Didn't Switch to vLLM)

Posted on Sat 09 May 2026 in Thought, AI, Security Research

This is a follow-up to what I learned running local models in my agent pipeline. That post covered context sizing and KV cache memory. This one covers what I got wrong about concurrency.

The Problem: Agents Queuing Up

My pipeline runs up to four agents simultaneously, called step1-2, step3, step4, probe-builder - all routing through the same llama-server instance for local workloads. I was getting slow responses and assumed it was just the model. It wasn't.

The issue was that I never set --parallel on llama-server. Without it, the default is one concurrent sequence. Every agent request waits in line behind the one before it. With four agents active at the same time, the third and fourth requests could wait minutes before inference even started.

The fix is one flag:

llama-server ... --parallel 4

Match the number to however many agents you actually run concurrently. Setting it higher than needed wastes KV cache slots without benefit.

KV Cache Had to Change to Make Room

Adding four parallel slots has a cost: each slot needs its own KV cache allocation. I was using --cache-type-k bf16 --cache-type-v bf16 from the previous post, which made sense when running a single sequence. With four slots it pushed the memory footprint up enough to start competing with the model weights for GPU working set.

Switching to q8_0 KV cache cuts that in half with no meaningful quality loss on analytical tasks:

--cache-type-k q8_0 --cache-type-v q8_0

The tradeoff is worth it. q8_0 is still high precision for KV, and the memory freed keeps all four parallel slots comfortably within the Metal working set on my M1 Max.

Why Not vLLM

The concurrent problem made me look at vLLM as an alternative. The case for it is real: PagedAttention, native request batching, better throughput under concurrent load. I've run multi-agent pipelines that would benefit from that.

The case against it for my setup: vLLM requires CUDA. It doesn't support Metal. My inference runs on macOS with an M1 Max GPU and --ngl 999 offloads everything to Metal. vLLM in CPU mode is too slow to be useful.

The other issue is model format. llama-server uses GGUF, which gives fine-grained quantization control and lets me run larger models within a tight memory budget. vLLM uses HuggingFace checkpoints and depends on pre-quantized variants for memory reduction. Less flexibility.

If I were on a box with an NVIDIA GPU (Hopefully a DGX or Asus Ascent soon) running six or more agents concurrently, the calculation changes. For my current setup, four agents, Mac, Metal fixing --parallel in llama-server closes the gap without any of the migration cost.

Testing the Model With a Real Task

After the config changes, I wanted to measure quality, not just speed. Running curl timing tests tells you tokens per second. It doesn't tell you whether the model is useful for what you actually need it to do.

I picked a WebAssembly security bug from my research notes, a missing isMemory64() guard in the OMG JIT's bulk-memory fast paths (addMemoryFill and addMemoryCopy) in WasmOMGIRGenerator.cpp. The bug causes address truncation after tier-up: a ZExt32 applied to an Int64 pointer discards the high 32 bits before the bounds check, so an OOB i64 address with in-bounds low bits passes and the write executes at the wrong offset. I gave the model the bug description and asked it to write a jsc proof-of-concept.

It took 196 seconds and 5726 tokens to produce a response. Generation speed was around 42 tok/s, but Qwen3's thinking mode burns 150–500 reasoning tokens before producing any content, so a significant portion of that time is the model reasoning before it writes a single line of output.

What it got right: the overall structure was correct. It used a WAT module, set the right function signatures for memory64 (param i64 i32 i64 for fill, param i64 i64 i64 for copy), used the right attack address (0x100000001n to test high-bits truncation), set the warm-up loop to 20K iterations, and wrapped the call in try/catch with a sentinel check.

What it got wrong: three things that would prevent it from running.

First, the memory declaration. It wrote (memory (export "mem") 1 1) which declares a 32-bit memory. The bug only triggers in memory64 modules. It needed (memory i64 1). Without this the entire exploit attempt is testing the wrong thing.

Second, WAT compilation. It passed the WAT text through new TextEncoder().encode(wat) and handed that to WebAssembly.Module. That doesn't work WebAssembly.Module takes binary bytecode, not WAT text.

Third, the API. It used console.log instead of print. In jsc console.log doesn't exist.

It also only tested address truncation and missed the count truncation case, which is a separate exploitable variant of the same bug.

The memory declaration mistake is conceptual, the model understood it was targeting a memory64 bug but didn't translate that understanding into the WAT declaration correctly. The other two are environment gaps: not knowing that jsc lacks a WAT compiler and uses print instead of console.log. I should also note the frontier model was aware of WASM specs as a source, which the local models did not recieve in this testing.

A researcher could fix all three in potentially a short amount of time. For a step3 task where I'm validating a hypothesis about whether a bug is exploitable, this is usable output. For a probe-builder task producing something I'm going to run directly, I'd catch the errors in review before executing.

Thinking Mode and Claude Code Timeouts

One issue I hadn't accounted for: Qwen3 in thinking mode produces output in two fields. The reasoning goes into reasoning_content, not content. A short max_tokens budget gets consumed by reasoning before any content is generated, and the response comes back with an empty content field and a finish_reason: length stop.

My LiteLLM config already had max_tokens: 32768 which is high enough to avoid this in practice. But the related timeout issue was real. Claude Code streams requests, and a thinking model that burns seconds of reasoning before the first content token can look like a stalled connection. I added two settings to the LiteLLM config to handle this:

litellm_settings:
  drop_params: true
  request_timeout: 300
  stream_timeout: 120

request_timeout covers the full request lifecycle. stream_timeout handles the case where llama-server goes quiet mid-stream, which can happen when all parallel slots are full and a new slot opens but takes a moment to schedule.

The Full Qwen3 Command

For reference, the current command I'm running:

llama-server -m ~/models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
  -c 65536 \
  -ngl 999 \
  --flash-attn on \
  -b 4096 \
  -ub 2048 \
  --fit off \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -n 32768 \
  --jinja \
  --parallel 4 \
  --mlock \
  --port 8081

The flags that changed from the previous post: --cache-type-k/--cache-type-v moved from bf16 to q8_0, and --parallel 4 was added. Everything else is the same. --mlock pins the weights in RAM to prevent swap under Metal memory pressure when multiple agents are competing for GPU working set.