57% Cost Cut: Model Routing for Multi-Agent Systems
One line of YAML. That was it. 1model: sonnet This single line in an agent's frontmatter reduced our per-run costs for one of the most-used agents in our

One line of YAML. That was it.
1model: sonnet
This single line in an agent's frontmatter reduced our per-run costs for one of the most-used agents in our multi-agent architecture by ~57%. Not by removing features. Not by accepting worse workflow compliance. By recognizing that not every agent needs the most powerful model. (With a trade-off in research breadth that we break down below.)
I'm writing this because I made the mistake that probably every team working with AI agents makes: running everything on the best available model because you can. And because you never bothered to measure whether you need to.
We measured. Here are the numbers.
At Infralovers, we run a multi-agent architecture on Claude Code. Nine specialized agents covering different tasks -- from technology research and content creation to knowledge base queries and channel routing. Each agent has a clear responsibility, its own tool permissions, and a defined skill set. It's the microservices pattern, applied to AI agents.
The system runs on Claude Code -- Anthropic's agentic coding tool that runs as a CLI in the terminal. What makes it elegant: agents are defined as YAML frontmatter in Markdown files. Name, description, allowed tools, loaded skills -- and in current Claude Code versions, the model. One line in the frontmatter determines which Claude model the agent uses:
1---
2name: technology-researcher
3description: Research open source projects and technology vendors
4tools: [Read, Write, Bash, Glob, Grep, WebSearch, WebFetch]
5model: sonnet # <-- this one line
6skills: [research, knowledge-base]
7---
This isn't a feature that gets much attention. It's in the docs, it works, but most teams don't use it. The default depends on your account type (e.g., Pro → Sonnet 4.6, Max → Opus 4.6) and can automatically fall back to Sonnet when you hit usage limits. And why would you use less when you can have more?
Because it costs money. And because "more" doesn't always mean "better."
On February 17, 2026, Anthropic released Sonnet 4.6. What caught my attention: the gap between Sonnet and Opus has never been this small.
Some benchmarks for context:
(Numbers from Anthropic System Cards and GDPval from Artificial Analysis.)
This is remarkable. For the first time, a mid-tier model beats a prior-generation flagship in multiple benchmarks. And not in trivial tasks -- in agentic work. Exactly what our agents do.
The question wasn't "Is Sonnet good enough?" but "For which agents is Opus actually necessary?"
I'll be honest: reading benchmarks is nice, but I only trust numbers from my own system. So we set up a clean A/B test.
model: field setmodel: sonnetNo synthetic benchmarks. No simplified tasks. The agent did exactly what it always does: search websites, evaluate sources, persist structured data, write reports. Each evaluation is a separate run with its own task input; the 3 parallel agents serve only to reduce wall-clock time, not as an independent treatment variable.
Opus 4.6 (n=9)
| Metric | Value |
|---|---|
| Avg Total Tokens | 7,501,012 |
| Avg API Calls | 120.7 |
| Avg Tool Uses | 94.3 |
| Avg Duration | 462s |
| Avg Projects Evaluated | 9.0 |
Sonnet 4.6 (n=9, excl. outlier n=8)
| Metric | All (n=9) | Excl. Outlier (n=8) |
|---|---|---|
| Avg Total Tokens | 5,351,844 | 4,843,462 |
| Avg API Calls | 96.0 | 90.0 |
| Avg Tool Uses | 77.7 | 72.3 |
| Avg Duration | 476s | 408s |
| Avg Projects Evaluated | 6.3 | 6.3 |
One Sonnet run hit infrastructure problems -- Bash permission errors led to retries that inflated tokens and runtime to 9.4 million tokens and 1,018 seconds. That's an infrastructure artifact, not a model problem. Hence the cleaned n=8 analysis. We therefore report both (n=9 and n=8) and use n=8 for deltas, since the outlier is clearly attributable to infrastructure retries.
The deltas (cleaned, excluding outlier):
| Metric | Delta | Delta % |
|---|---|---|
| Total Tokens | -2,657,550 | -35.4% |
| API Calls | -30.7 | -25.4% |
| Tool Uses | -22.1 | -23.4% |
| Duration | -53s | -11.6% |
| Projects Evaluated | -2.8 | -30.5% |
Sonnet uses a third fewer tokens, a quarter fewer API calls, and is 12% faster. But -- and this matters -- it also finds less: 6.3 evaluated projects per run instead of 9.0.
By "costs" I mean LLM token costs exclusively (input, output, cache read, cache create) -- not infrastructure or tool provider costs.
The raw token savings of 35% are only half the story. The other half is the pricing difference.
| Component | Opus 4.6 | Sonnet 4.6 | Factor |
|---|---|---|---|
| Input Tokens | $5.00/MTok | $3.00/MTok | 0.60x |
| Output Tokens | $25.00/MTok | $15.00/MTok | 0.60x |
| Cache Read (5min) | $0.50/MTok | $0.30/MTok | 0.60x |
| Cache Create | $6.25/MTok | $3.75/MTok | 0.60x |
Sonnet costs a uniform 60% of the Opus price across all token types. That's unusually clean -- normally the ratios vary by token type. (Cache prices follow a fixed formula: Cache Read = Input price x 0.1; Cache Create = Input price x 1.25.)
Now the compound math: in our setup, agent sessions consist (per our logs) of roughly 90% cache reads (the agent reloads its system prompt and conversation context with every API call). The relevant price is primarily the cache-read price.
Sonnet factor: 0.60 (price per token) x 0.65 (volume, 35% fewer tokens) = 0.39
Every Sonnet run costs 39% of what an Opus run costs. Or put differently: 61% cost reduction per agent run.
In practice, it's conservatively around 57%, because the token composition (input vs. output vs. cache) isn't exactly 90/10 and the outlier slightly shifts the averages. (Throughout this post, I use ~57% as the conservative real-world figure; the model calculation yields 61% under our cache assumptions.)
Important: this is a cost reduction per run, not automatically per evaluated project. Since Sonnet covers fewer projects per run on average, actual "cost per project" depends on whether you count breadth as a quality criterion.
Rule of thumb: if your goal is "maximum coverage," evaluate cost per project as well. If your goal is "top findings with low noise," cost per run plus compliance is often sufficient as your primary metric.
For a team running dozens of agent runs daily, that's not a rounding error. That's a structural cost advantage.
"57% cheaper sounds great, but what do you lose?" -- the only question that matters.
Both models followed all agent instructions completely. The workflow was identical:
Not a single run skipped a step. Not a single run ignored instructions. Across all 18 runs, we observed zero step-skips; all 7 workflow steps were executed.
| Dimension | Opus 4.6 | Sonnet 4.6 |
|---|---|---|
| Evaluation depth | Deep, multi-source, often finds edge cases | Solid, same sources, occasionally narrower |
| Projects per run | 9.0 (wide net, more secondary findings) | 6.3 (more selective, higher confidence) |
| Validation rigor | Lists more, validates broadly | Strict rejection, fewer but more reliable |
| Report depth | Longer reports, more strategic commentary | Concise, actionable |
| Source citation | Usually cites | More consistent with URLs |
| API persistence | Always complete | Always complete |
The key insight: Opus casts a wider net, Sonnet is more selective. Opus finds more, including secondary and tangential results. Sonnet finds less, but with a higher signal-to-noise ratio.
Which approach is better depends on your use case. For exploratory research where you want as many starting points as possible, Opus has an edge. For focused evaluations where precision matters more than breadth, Sonnet is at least equivalent.
For our use case -- technology evaluations for a technology radar -- Sonnet's selectivity is actually an advantage. We want the 5-7 most relevant findings, not 10 with noise in between.
If you've worked with Kubernetes, you know the pattern: you set resource requests to "generous" because you don't want to worry about it. Every pod gets 2 CPU cores and 4GB RAM, even though most run fine on 0.5 cores and 512MB. Then you look at the cloud bill and wonder what happened.
Model routing for AI agents is exactly the same pattern. You have a large model (Opus = the oversized pod) and an efficient model (Sonnet = the right-sized pod). The question isn't "What's the best model?" but "What's the right model for this specific agent?"
This isn't a new concept. The research is solid:
What we did is the simplest variant: static routing. No ML classifier, no routing framework, no additional infrastructure. One decision, made once, encoded in one line of YAML. Deliberately boring -- and deliberately effective.
Here's something interesting: Anthropic practices this internally. Claude Code's own Explore agent runs on Haiku by default -- the smallest model in the Claude family. If Anthropic uses model tiering for its own agents, that should be a strong signal.
After the benchmark, we explicitly assigned models to all 9 agents in our architecture:
| Model | Agents | Tasks |
|---|---|---|
| Sonnet | 7 | Research, knowledge base queries, content creation, report drafting, channel routing, report generation, evaluations |
| Haiku | 2 | Data refresh (simple lookups), transcript extraction |
The logic:
Sonnet for everything that needs structured reasoning but no novel-reasoning breakthrough. That covers the majority of agentic work: following instructions, coordinating tools, collecting and structuring data, writing reports. Exactly the tasks where Sonnet 4.6 matches or beats Opus in benchmarks.
Haiku for everything that's essentially transformation: moving data from A to B, converting one format to another, extracting a transcript. Tasks where even the smallest model is reliable because the cognitive demand is low.
Opus stays available for the orchestrator -- the main Claude Code process that launches and coordinates the agents. Where strategic decisions are made, where context is complex, and where novel reasoning actually makes a difference.
A thought I can't shake: we've been measuring software delivery performance with DORA metrics for years. Deployment Frequency, Lead Time for Changes, Change Failure Rate, Mean Time to Recovery -- and in recent DORA reports, Reliability as a fifth metric. The research shows that high-performing teams are better across all dimensions, not just one.
Why aren't we measuring AI agent performance the same way?
The analogy maps almost 1:1:
| DORA Metric | AI Agent Equivalent |
|---|---|
| Deployment Frequency | Agent invocation frequency |
| Lead Time for Changes | Time per agent task |
| Change Failure Rate | Agent error rate / compliance rate |
| Mean Time to Recovery | Recovery after failed runs |
| Reliability | Consistency of output quality |
Our benchmark essentially did exactly that: systematically compared throughput (tokens, duration), error behavior (outlier analysis), and output quality (evaluation depth, compliance). The DORA 2025 insight that "AI amplifies team dysfunction as often as capability" fits perfectly: a poorly configured multi-agent system doesn't get better with a more powerful model -- it gets more expensive.
I'll be honest: I should have run this benchmark sooner. We ran every agent on Opus for weeks because it was the default and because the results were good. But "good" doesn't mean "optimal." And "default" doesn't mean "correctly configured."
What surprised me wasn't the cost savings (those were expected) but how little quality we lost. Or more precisely: how different the quality is. Sonnet doesn't produce "worse" results -- it produces "different" ones. More selective, more concise, better signal-to-noise. For most of our use cases, that's an upgrade, not a downgrade.
What gives me pause: more and more teams are running multiple model families in parallel. But very few make deliberate decisions about which model handles which task. That's like giving every Kubernetes pod the same resource limits because you're too busy to measure actual usage. It works -- until the bill arrives.
Multi-agent architectures are becoming the standard in enterprise AI. And all those deployments will reach this inflection point. The question isn't whether, it's when. Teams that start measuring and optimizing early will have a structural advantage -- not just on costs, but on the quality of their outputs.
The market is moving fast. But according to industry surveys, roughly half of deployed agents operate in silos, without coordinated management. Model routing is a first, simple step out of that silo: when you start thinking deliberately about model assignment, you automatically start thinking about agent architecture. And that's the real win.
These results apply primarily to our Technology Researcher (tool-heavy, web research, knowledge base writes) and our specific prompt and cache mix. Other agents -- more output-heavy, more planning, less tool usage -- may show different cost and quality profiles. Our sample size (n=9 per group) is at the lower end of statistical reliability; larger n would narrow confidence intervals.
Three concrete recommendations:
No benchmark, no decision. We needed 18 full runs to get reliable data. Fewer than 10 runs per variant aren't statistically meaningful enough, and even our 9 per group are on the low end. Track at minimum: total tokens, API calls, duration, and a meaningful quality metric for your use case.
Dynamic routing and cascade architectures sound appealing, but they add complexity that's unnecessary in most cases. Static routing -- assigning a model per agent based on the task profile -- is the 80/20 solution. Zero overhead, zero infrastructure, one line of config.
The rule of thumb that worked for us:
Models improve. Sonnet 4.6 isn't Sonnet 4.0. What needs Opus today might run on Sonnet in six months. And what needs Sonnet today might run on Haiku soon. A quarterly review of model assignments belongs in your team routine, just like reviewing Kubernetes resource requests.
If you're using Claude Code with agents: open the agent file of the agent that consumes the most tokens. Add model: sonnet. Run 10 iterations. Compare tokens, duration, and output quality. Done.
If the quality holds, you just cut your costs in half. If it doesn't, you now have valuable data about exactly where Opus makes the difference -- and can invest deliberately instead of blanket-spending.
That's the AI equivalent of right-sizing in Kubernetes: not the biggest model for every task, but the right model for the right task. One line of YAML, ~57% lower costs per agent run -- with identical workflow compliance and a deliberate trade-off in research breadth.
Not bad for one line of config.
model: fieldYou are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.
Contact us