57% Cost Cut: Model Routing for Multi-Agent Systems


Bicycle

One line of YAML. That was it.

1model: sonnet

This single line in an agent's frontmatter reduced our per-run costs for one of the most-used agents in our multi-agent architecture by ~57%. Not by removing features. Not by accepting worse workflow compliance. By recognizing that not every agent needs the most powerful model. (With a trade-off in research breadth that we break down below.)

I'm writing this because I made the mistake that probably every team working with AI agents makes: running everything on the best available model because you can. And because you never bothered to measure whether you need to.

We measured. Here are the numbers.

The Context: 9 Agents, One Orchestra

At Infralovers, we run a multi-agent architecture on Claude Code. Nine specialized agents covering different tasks -- from technology research and content creation to knowledge base queries and channel routing. Each agent has a clear responsibility, its own tool permissions, and a defined skill set. It's the microservices pattern, applied to AI agents.

The system runs on Claude Code -- Anthropic's agentic coding tool that runs as a CLI in the terminal. What makes it elegant: agents are defined as YAML frontmatter in Markdown files. Name, description, allowed tools, loaded skills -- and in current Claude Code versions, the model. One line in the frontmatter determines which Claude model the agent uses:

1---
2name: technology-researcher
3description: Research open source projects and technology vendors
4tools: [Read, Write, Bash, Glob, Grep, WebSearch, WebFetch]
5model: sonnet  # <-- this one line
6skills: [research, knowledge-base]
7---

This isn't a feature that gets much attention. It's in the docs, it works, but most teams don't use it. The default depends on your account type (e.g., Pro → Sonnet 4.6, Max → Opus 4.6) and can automatically fall back to Sonnet when you hit usage limits. And why would you use less when you can have more?

Because it costs money. And because "more" doesn't always mean "better."

Why Sonnet 4.6 Was Worth Testing

On February 17, 2026, Anthropic released Sonnet 4.6. What caught my attention: the gap between Sonnet and Opus has never been this small.

Some benchmarks for context:

  • MCP-Atlas (Tool Use): Sonnet 4.6 61.3% (max effort) vs. Opus 4.6 59.5% -- Sonnet leads (historical top score: Opus 4.5 at 62.3%)
  • SWE-bench Verified (Coding): 79.6% vs. 80.8% -- marginal Opus advantage
  • Finance Agent / Vals AI: Sonnet beats Opus (63.3% vs. 60.1%)
  • GDPval-AA (Knowledge Work): Sonnet beats Opus (1633 vs. 1606 Elo) -- difference within 95% confidence interval
  • ARC-AGI-2 (Novel Reasoning): Opus 68.8% vs. Sonnet 58.3% -- this is where the real difference shows

(Numbers from Anthropic System Cards and GDPval from Artificial Analysis.)

This is remarkable. For the first time, a mid-tier model beats a prior-generation flagship in multiple benchmarks. And not in trivial tasks -- in agentic work. Exactly what our agents do.

The question wasn't "Is Sonnet good enough?" but "For which agents is Opus actually necessary?"

The Experiment: 18 Runs, Real Data

I'll be honest: reading benchmarks is nice, but I only trust numbers from my own system. So we set up a clean A/B test.

Setup

  • Agent: Technology Researcher -- our agent for technology evaluations. Researches open-source projects and technology vendors, analyzes tech stacks, assesses maturity and community health, persists structured findings to our knowledge base via API calls, and generates evaluation reports.
  • Baseline (Group A): Technology Researcher on default (in our setup = Opus 4.6) -- no model: field set
  • Treatment (Group B): Technology Researcher explicitly set to model: sonnet
  • Scale: 3 rounds per treatment, 3 parallel agents per round, 9 evaluations per group
  • Conditions: Full realistic runs with all integrations active -- WebFetch, WebSearch, API persistence, report generation

No synthetic benchmarks. No simplified tasks. The agent did exactly what it always does: search websites, evaluate sources, persist structured data, write reports. Each evaluation is a separate run with its own task input; the 3 parallel agents serve only to reduce wall-clock time, not as an independent treatment variable.

The Raw Data

Opus 4.6 (n=9)

MetricValue
Avg Total Tokens7,501,012
Avg API Calls120.7
Avg Tool Uses94.3
Avg Duration462s
Avg Projects Evaluated9.0

Sonnet 4.6 (n=9, excl. outlier n=8)

MetricAll (n=9)Excl. Outlier (n=8)
Avg Total Tokens5,351,8444,843,462
Avg API Calls96.090.0
Avg Tool Uses77.772.3
Avg Duration476s408s
Avg Projects Evaluated6.36.3

One Sonnet run hit infrastructure problems -- Bash permission errors led to retries that inflated tokens and runtime to 9.4 million tokens and 1,018 seconds. That's an infrastructure artifact, not a model problem. Hence the cleaned n=8 analysis. We therefore report both (n=9 and n=8) and use n=8 for deltas, since the outlier is clearly attributable to infrastructure retries.

What the Numbers Say

The deltas (cleaned, excluding outlier):

MetricDeltaDelta %
Total Tokens-2,657,550-35.4%
API Calls-30.7-25.4%
Tool Uses-22.1-23.4%
Duration-53s-11.6%
Projects Evaluated-2.8-30.5%

Sonnet uses a third fewer tokens, a quarter fewer API calls, and is 12% faster. But -- and this matters -- it also finds less: 6.3 evaluated projects per run instead of 9.0.

The Cost Math: Why 57%

By "costs" I mean LLM token costs exclusively (input, output, cache read, cache create) -- not infrastructure or tool provider costs.

The raw token savings of 35% are only half the story. The other half is the pricing difference.

ComponentOpus 4.6Sonnet 4.6Factor
Input Tokens$5.00/MTok$3.00/MTok0.60x
Output Tokens$25.00/MTok$15.00/MTok0.60x
Cache Read (5min)$0.50/MTok$0.30/MTok0.60x
Cache Create$6.25/MTok$3.75/MTok0.60x

Sonnet costs a uniform 60% of the Opus price across all token types. That's unusually clean -- normally the ratios vary by token type. (Cache prices follow a fixed formula: Cache Read = Input price x 0.1; Cache Create = Input price x 1.25.)

Now the compound math: in our setup, agent sessions consist (per our logs) of roughly 90% cache reads (the agent reloads its system prompt and conversation context with every API call). The relevant price is primarily the cache-read price.

Sonnet factor: 0.60 (price per token) x 0.65 (volume, 35% fewer tokens) = 0.39

Every Sonnet run costs 39% of what an Opus run costs. Or put differently: 61% cost reduction per agent run.

In practice, it's conservatively around 57%, because the token composition (input vs. output vs. cache) isn't exactly 90/10 and the outlier slightly shifts the averages. (Throughout this post, I use ~57% as the conservative real-world figure; the model calculation yields 61% under our cache assumptions.)

Important: this is a cost reduction per run, not automatically per evaluated project. Since Sonnet covers fewer projects per run on average, actual "cost per project" depends on whether you count breadth as a quality criterion.

Rule of thumb: if your goal is "maximum coverage," evaluate cost per project as well. If your goal is "top findings with low noise," cost per run plus compliance is often sufficient as your primary metric.

For a team running dozens of agent runs daily, that's not a rounding error. That's a structural cost advantage.

The Quality Question: Less Isn't Worse

"57% cheaper sounds great, but what do you lose?" -- the only question that matters.

Both models followed all agent instructions completely. The workflow was identical:

  1. Resolve target in the knowledge base
  2. Run enrichment check
  3. Research websites via WebFetch
  4. Find relevant sources via WebSearch
  5. Persist structured data via API calls
  6. Generate evaluation report
  7. Open documentation pages for manual review

Not a single run skipped a step. Not a single run ignored instructions. Across all 18 runs, we observed zero step-skips; all 7 workflow steps were executed.

Where the Models Differ (Qualitative Observations from 18 Runs)

DimensionOpus 4.6Sonnet 4.6
Evaluation depthDeep, multi-source, often finds edge casesSolid, same sources, occasionally narrower
Projects per run9.0 (wide net, more secondary findings)6.3 (more selective, higher confidence)
Validation rigorLists more, validates broadlyStrict rejection, fewer but more reliable
Report depthLonger reports, more strategic commentaryConcise, actionable
Source citationUsually citesMore consistent with URLs
API persistenceAlways completeAlways complete

The key insight: Opus casts a wider net, Sonnet is more selective. Opus finds more, including secondary and tangential results. Sonnet finds less, but with a higher signal-to-noise ratio.

Which approach is better depends on your use case. For exploratory research where you want as many starting points as possible, Opus has an edge. For focused evaluations where precision matters more than breadth, Sonnet is at least equivalent.

For our use case -- technology evaluations for a technology radar -- Sonnet's selectivity is actually an advantage. We want the 5-7 most relevant findings, not 10 with noise in between.

The Kubernetes Analogy: Right-Sizing for AI Agents

If you've worked with Kubernetes, you know the pattern: you set resource requests to "generous" because you don't want to worry about it. Every pod gets 2 CPU cores and 4GB RAM, even though most run fine on 0.5 cores and 512MB. Then you look at the cloud bill and wonder what happened.

Model routing for AI agents is exactly the same pattern. You have a large model (Opus = the oversized pod) and an efficient model (Sonnet = the right-sized pod). The question isn't "What's the best model?" but "What's the right model for this specific agent?"

This isn't a new concept. The research is solid:

  • FrugalGPT (Stanford) showed that intelligent model routing can deliver 50-98% cost reduction, depending on the task
  • Static Routing (what we did): models are assigned at design time. No overhead, no classifier, no routing layer. One line of YAML.
  • Dynamic Routing: a classifier decides per query which model responds. More complex, but more adaptive.
  • Cascade: starts cheap, escalates on low confidence. Smart, but requires confidence metrics.

What we did is the simplest variant: static routing. No ML classifier, no routing framework, no additional infrastructure. One decision, made once, encoded in one line of YAML. Deliberately boring -- and deliberately effective.

Here's something interesting: Anthropic practices this internally. Claude Code's own Explore agent runs on Haiku by default -- the smallest model in the Claude family. If Anthropic uses model tiering for its own agents, that should be a strong signal.

Our Model Assignment: The Result

After the benchmark, we explicitly assigned models to all 9 agents in our architecture:

ModelAgentsTasks
Sonnet7Research, knowledge base queries, content creation, report drafting, channel routing, report generation, evaluations
Haiku2Data refresh (simple lookups), transcript extraction

The logic:

Sonnet for everything that needs structured reasoning but no novel-reasoning breakthrough. That covers the majority of agentic work: following instructions, coordinating tools, collecting and structuring data, writing reports. Exactly the tasks where Sonnet 4.6 matches or beats Opus in benchmarks.

Haiku for everything that's essentially transformation: moving data from A to B, converting one format to another, extracting a transcript. Tasks where even the smallest model is reliable because the cognitive demand is low.

Opus stays available for the orchestrator -- the main Claude Code process that launches and coordinates the agents. Where strategic decisions are made, where context is complex, and where novel reasoning actually makes a difference.

The DORA Perspective: Measuring AI Agent Performance

A thought I can't shake: we've been measuring software delivery performance with DORA metrics for years. Deployment Frequency, Lead Time for Changes, Change Failure Rate, Mean Time to Recovery -- and in recent DORA reports, Reliability as a fifth metric. The research shows that high-performing teams are better across all dimensions, not just one.

Why aren't we measuring AI agent performance the same way?

The analogy maps almost 1:1:

DORA MetricAI Agent Equivalent
Deployment FrequencyAgent invocation frequency
Lead Time for ChangesTime per agent task
Change Failure RateAgent error rate / compliance rate
Mean Time to RecoveryRecovery after failed runs
ReliabilityConsistency of output quality

Our benchmark essentially did exactly that: systematically compared throughput (tokens, duration), error behavior (outlier analysis), and output quality (evaluation depth, compliance). The DORA 2025 insight that "AI amplifies team dysfunction as often as capability" fits perfectly: a poorly configured multi-agent system doesn't get better with a more powerful model -- it gets more expensive.

My Take

I'll be honest: I should have run this benchmark sooner. We ran every agent on Opus for weeks because it was the default and because the results were good. But "good" doesn't mean "optimal." And "default" doesn't mean "correctly configured."

What surprised me wasn't the cost savings (those were expected) but how little quality we lost. Or more precisely: how different the quality is. Sonnet doesn't produce "worse" results -- it produces "different" ones. More selective, more concise, better signal-to-noise. For most of our use cases, that's an upgrade, not a downgrade.

What gives me pause: more and more teams are running multiple model families in parallel. But very few make deliberate decisions about which model handles which task. That's like giving every Kubernetes pod the same resource limits because you're too busy to measure actual usage. It works -- until the bill arrives.

Multi-agent architectures are becoming the standard in enterprise AI. And all those deployments will reach this inflection point. The question isn't whether, it's when. Teams that start measuring and optimizing early will have a structural advantage -- not just on costs, but on the quality of their outputs.

The market is moving fast. But according to industry surveys, roughly half of deployed agents operate in silos, without coordinated management. Model routing is a first, simple step out of that silo: when you start thinking deliberately about model assignment, you automatically start thinking about agent architecture. And that's the real win.

Limitations

These results apply primarily to our Technology Researcher (tool-heavy, web research, knowledge base writes) and our specific prompt and cache mix. Other agents -- more output-heavy, more planning, less tool usage -- may show different cost and quality profiles. Our sample size (n=9 per group) is at the lower end of statistical reliability; larger n would narrow confidence intervals.

What This Means for Engineering Teams

Three concrete recommendations:

1. Measure Before You Optimize

No benchmark, no decision. We needed 18 full runs to get reliable data. Fewer than 10 runs per variant aren't statistically meaningful enough, and even our 9 per group are on the low end. Track at minimum: total tokens, API calls, duration, and a meaningful quality metric for your use case.

2. Start with Static Routing

Dynamic routing and cascade architectures sound appealing, but they add complexity that's unnecessary in most cases. Static routing -- assigning a model per agent based on the task profile -- is the 80/20 solution. Zero overhead, zero infrastructure, one line of config.

The rule of thumb that worked for us:

  • Haiku: Transformation, extraction, simple lookups
  • Sonnet: Structured reasoning, tool coordination, report generation
  • Opus: Novel reasoning, strategic decisions, complex multi-step planning

3. Review Quarterly

Models improve. Sonnet 4.6 isn't Sonnet 4.0. What needs Opus today might run on Sonnet in six months. And what needs Sonnet today might run on Haiku soon. A quarterly review of model assignments belongs in your team routine, just like reviewing Kubernetes resource requests.

The First Step

If you're using Claude Code with agents: open the agent file of the agent that consumes the most tokens. Add model: sonnet. Run 10 iterations. Compare tokens, duration, and output quality. Done.

If the quality holds, you just cut your costs in half. If it doesn't, you now have valuable data about exactly where Opus makes the difference -- and can invest deliberately instead of blanket-spending.

That's the AI equivalent of right-sizing in Kubernetes: not the biggest model for every task, but the right model for the right task. One line of YAML, ~57% lower costs per agent run -- with identical workflow compliance and a deliberate trade-off in research breadth.

Not bad for one line of config.


References

Anthropic Models and Claude Code

Model Routing and Cost Optimization

Multi-Agent Architectures

Infralovers Trainings

Go Back explore our courses

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.

Contact us