How I Cut AI Agent Costs Without Cutting Corners
Posted on Mon 20 April 2026 in Thought, AI, Security Research
Running a multi-agent pipeline for security research gets expensive fast. I have several agents doing sequential analysis work - reading commit diffs, running adversarial bypass analysis, building proof-of-concept exploits. Token costs compound at every step: system prompts, tool schemas, conversation history, and verbose outputs all stack up before a single useful result comes back.
I went through my pipeline and found three areas worth fixing.
Right Model for the Right Job
Not every agent needs the most capable model. Some of my agents do purely mechanical work such as reading diffs, classifying bug patterns, producing structured output. Others do the hard reasoning: adversarial analysis, finding what a security fix missed, writing actual attack probes.
The fix was straightforward. I assigned smaller, faster models (Haiku) to the mechanical roles and kept the stronger models (Sonnet) for the reasoning-heavy stages. In Claude Code, this is a one-line change in each agent's definition file:
---
model: haiku
---
No code changes. Immediate cost reduction on every run. The agents doing diff classification and grep-based sibling searches don't need Sonnet or Opus level reasoning - they need to fill structured fields accurately and quickly. Haiku handles that fine.
Structured Handoffs, Not Essays
The second problem was output format. My agents would complete their analysis and produce a mix of prose narrative and structured data. Sometimes the useful structured block was buried under paragraphs of explanation that the next agent didn't need and never read.
Two things fixed this:
Strict output rules. I added an explicit rule to each agent's prompt: output only the structured block between the defined delimiters. No headings, no narrative, no summaries outside the block. All analysis goes inside the structured fields where it belongs.
Tighter prompts. I audited each agent's system prompt and removed anything that wasn't load-bearing. Things I cut: 25 lines of regex pattern examples the model already knows, multi-paragraph field descriptions that could be one line, redundant guidance that duplicated things said elsewhere. Brought one prompt from 215 lines down to 185 without losing any output quality.
The handoff between agents is now a clean structured block and exactly what the next stage needs, nothing it doesn't.
Local Model Fallback
Security research workflows can trigger usage policy flags on hosted models. Rather than getting blocked mid-pipeline, I set up a local fallback using models I already run locally via llama-server and LiteLLM.
LiteLLM exposes an Anthropic-compatible API endpoint. Claude Code respects ANTHROPIC_BASE_URL which you can then point it at LiteLLM and requests route to local models transparently. The agents don't change. Tool use still works. The only difference is where the inference runs.
ANTHROPIC_BASE_URL=http://localhost:4000 claude
I run two local models: Qwen3.6-35B for the reasoning-heavy agents (same tier as Sonnet / Opus), Gemma-4B for the mechanical ones. Same split as the hosted setup, just local. And since LiteLLM supports automatic fallback configuration, when Anthropic flags a request the pipeline retries against the local model without any manual intervention.
The Result
- Mechanical analysis stages running on Haiku instead of Sonnet — meaningful per-run savings
- Cleaner agent handoffs: structured blocks only, no prose overhead between pipeline stages
- No single point of failure: local fallback when hosted models push back on security research content
The pipeline produces the same quality output. It just costs less to run and handles edge cases more gracefully. If you are building multi-agent pipelines and haven't audited your model assignments and output formats yet, that's probably your lowest-effort win. Don't let the hosted model providers block your research :).
If you have questions or want to swap notes on building research pipelines, feel free to reach out.