57% Cost Cut: Model Routing for Multi-Agent Systems

Edmund Haselwanter | 19.02.2026 artificial intelligence, devops

One line of YAML. That was it.

1model: sonnet

This single line in an agent's frontmatter reduced our per-run costs for one of the most-used agents in our multi-agent architecture by ~57%. Not by removing features. Not by accepting worse workflow compliance. By recognizing that not every agent needs the most powerful model. (With a trade-off in research breadth that we break down below.)

I'm writing this because I made the mistake that probably every team working with AI agents makes: running everything on the best available model because you can. And because you never bothered to measure whether you need to.

We measured. Here are the numbers.

The Context: 9 Agents, One Orchestra

At Infralovers, we run a multi-agent architecture on Claude Code. Nine specialized agents covering different tasks -- from technology research and content creation to knowledge base queries and channel routing. Each agent has a clear responsibility, its own tool permissions, and a defined skill set. It's the microservices pattern, applied to AI agents.

The system runs on Claude Code -- Anthropic's agentic coding tool that runs as a CLI in the terminal. What makes it elegant: agents are defined as YAML frontmatter in Markdown files. Name, description, allowed tools, loaded skills -- and in current Claude Code versions, the model. One line in the frontmatter determines which Claude model the agent uses:

1---
2name: technology-researcher
3description: Research open source projects and technology vendors
4tools: [Read, Write, Bash, Glob, Grep, WebSearch, WebFetch]
5model: sonnet  # <-- this one line
6skills: [research, knowledge-base]
7---

This isn't a feature that gets much attention. It's in the docs, it works, but most teams don't use it. The default depends on your account type (e.g., Pro → Sonnet 4.6, Max → Opus 4.6) and can automatically fall back to Sonnet when you hit usage limits. And why would you use less when you can have more?

Because it costs money. And because "more" doesn't always mean "better."

Why Sonnet 4.6 Was Worth Testing

On February 17, 2026, Anthropic released Sonnet 4.6. What caught my attention: the gap between Sonnet and Opus has never been this small.

Some benchmarks for context:

MCP-Atlas (Tool Use): Sonnet 4.6 61.3% (max effort) vs. Opus 4.6 59.5% -- Sonnet leads (historical top score: Opus 4.5 at 62.3%)
SWE-bench Verified (Coding): 79.6% vs. 80.8% -- marginal Opus advantage
Finance Agent / Vals AI: Sonnet beats Opus (63.3% vs. 60.1%)
GDPval-AA (Knowledge Work): Sonnet beats Opus (1633 vs. 1606 Elo) -- difference within 95% confidence interval
ARC-AGI-2 (Novel Reasoning): Opus 68.8% vs. Sonnet 58.3% -- this is where the real difference shows

(Numbers from Anthropic System Cards and GDPval from Artificial Analysis.)

This is remarkable. For the first time, a mid-tier model beats a prior-generation flagship in multiple benchmarks. And not in trivial tasks -- in agentic work. Exactly what our agents do.

The question wasn't "Is Sonnet good enough?" but "For which agents is Opus actually necessary?"

The Experiment: 18 Runs, Real Data

I'll be honest: reading benchmarks is nice, but I only trust numbers from my own system. So we set up a clean A/B test.

Setup

Agent: Technology Researcher -- our agent for technology evaluations. Researches open-source projects and technology vendors, analyzes tech stacks, assesses maturity and community health, persists structured findings to our knowledge base via API calls, and generates evaluation reports.
Baseline (Group A): Technology Researcher on default (in our setup = Opus 4.6) -- no model: field set
Treatment (Group B): Technology Researcher explicitly set to model: sonnet
Scale: 3 rounds per treatment, 3 parallel agents per round, 9 evaluations per group
Conditions: Full realistic runs with all integrations active -- WebFetch, WebSearch, API persistence, report generation

No synthetic benchmarks. No simplified tasks. The agent did exactly what it always does: search websites, evaluate sources, persist structured data, write reports. Each evaluation is a separate run with its own task input; the 3 parallel agents serve only to reduce wall-clock time, not as an independent treatment variable.

The Raw Data

Opus 4.6 (n=9)

Metric	Value
Avg Total Tokens	7,501,012
Avg API Calls	120.7
Avg Tool Uses	94.3
Avg Duration	462s
Avg Projects Evaluated	9.0

Sonnet 4.6 (n=9, excl. outlier n=8)

Metric	All (n=9)	Excl. Outlier (n=8)
Avg Total Tokens	5,351,844	4,843,462
Avg API Calls	96.0	90.0
Avg Tool Uses	77.7	72.3
Avg Duration	476s	408s
Avg Projects Evaluated	6.3	6.3

One Sonnet run hit infrastructure problems -- Bash permission errors led to retries that inflated tokens and runtime to 9.4 million tokens and 1,018 seconds. That's an infrastructure artifact, not a model problem. Hence the cleaned n=8 analysis. We therefore report both (n=9 and n=8) and use n=8 for deltas, since the outlier is clearly attributable to infrastructure retries.

What the Numbers Say

The deltas (cleaned, excluding outlier):

Metric	Delta	Delta %
Total Tokens	-2,657,550	-35.4%
API Calls	-30.7	-25.4%
Tool Uses	-22.1	-23.4%
Duration	-53s	-11.6%
Projects Evaluated	-2.8	-30.5%

Sonnet uses a third fewer tokens, a quarter fewer API calls, and is 12% faster. But -- and this matters -- it also finds less: 6.3 evaluated projects per run instead of 9.0.

The Cost Math: Why 57%

By "costs" I mean LLM token costs exclusively (input, output, cache read, cache create) -- not infrastructure or tool provider costs.

The raw token savings of 35% are only half the story. The other half is the pricing difference.

Component	Opus 4.6	Sonnet 4.6	Factor
Input Tokens	$5.00/MTok	$3.00/MTok	0.60x
Output Tokens	$25.00/MTok	$15.00/MTok	0.60x
Cache Read (5min)	$0.50/MTok	$0.30/MTok	0.60x
Cache Create	$6.25/MTok	$3.75/MTok	0.60x

Sonnet costs a uniform 60% of the Opus price across all token types. That's unusually clean -- normally the ratios vary by token type. (Cache prices follow a fixed formula: Cache Read = Input price x 0.1; Cache Create = Input price x 1.25.)

Now the compound math: in our setup, agent sessions consist (per our logs) of roughly 90% cache reads (the agent reloads its system prompt and conversation context with every API call). The relevant price is primarily the cache-read price.

Sonnet factor: 0.60 (price per token) x 0.65 (volume, 35% fewer tokens) = 0.39

Every Sonnet run costs 39% of what an Opus run costs. Or put differently: 61% cost reduction per agent run.

In practice, it's conservatively around 57%, because the token composition (input vs. output vs. cache) isn't exactly 90/10 and the outlier slightly shifts the averages. (Throughout this post, I use ~57% as the conservative real-world figure; the model calculation yields 61% under our cache assumptions.)

Important: this is a cost reduction per run, not automatically per evaluated project. Since Sonnet covers fewer projects per run on average, actual "cost per project" depends on whether you count breadth as a quality criterion.

Rule of thumb: if your goal is "maximum coverage," evaluate cost per project as well. If your goal is "top findings with low noise," cost per run plus compliance is often sufficient as your primary metric.

For a team running dozens of agent runs daily, that's not a rounding error. That's a structural cost advantage.

The Quality Question: Less Isn't Worse

"57% cheaper sounds great, but what do you lose?" -- the only question that matters.

Both models followed all agent instructions completely. The workflow was identical:

Resolve target in the knowledge base
Run enrichment check
Research websites via WebFetch
Find relevant sources via WebSearch
Persist structured data via API calls
Generate evaluation report
Open documentation pages for manual review

Not a single run skipped a step. Not a single run ignored instructions. Across all 18 runs, we observed zero step-skips; all 7 workflow steps were executed.

Where the Models Differ (Qualitative Observations from 18 Runs)

Dimension	Opus 4.6	Sonnet 4.6
Evaluation depth	Deep, multi-source, often finds edge cases	Solid, same sources, occasionally narrower
Projects per run	9.0 (wide net, more secondary findings)	6.3 (more selective, higher confidence)
Validation rigor	Lists more, validates broadly	Strict rejection, fewer but more reliable
Report depth	Longer reports, more strategic commentary	Concise, actionable
Source citation	Usually cites	More consistent with URLs
API persistence	Always complete	Always complete

The key insight: Opus casts a wider net, Sonnet is more selective. Opus finds more, including secondary and tangential results. Sonnet finds less, but with a higher signal-to-noise ratio.

Which approach is better depends on your use case. For exploratory research where you want as many starting points as possible, Opus has an edge. For focused evaluations where precision matters more than breadth, Sonnet is at least equivalent.

For our use case -- technology evaluations for a technology radar -- Sonnet's selectivity is actually an advantage. We want the 5-7 most relevant findings, not 10 with noise in between.

The Kubernetes Analogy: Right-Sizing for AI Agents

If you've worked with Kubernetes, you know the pattern: you set resource requests to "generous" because you don't want to worry about it. Every pod gets 2 CPU cores and 4GB RAM, even though most run fine on 0.5 cores and 512MB. Then you look at the cloud bill and wonder what happened.

Model routing for AI agents is exactly the same pattern. You have a large model (Opus = the oversized pod) and an efficient model (Sonnet = the right-sized pod). The question isn't "What's the best model?" but "What's the right model for this specific agent?"

This isn't a new concept. The research is solid:

FrugalGPT (Stanford) showed that intelligent model routing can deliver 50-98% cost reduction, depending on the task
Static Routing (what we did): models are assigned at design time. No overhead, no classifier, no routing layer. One line of YAML.
Dynamic Routing: a classifier decides per query which model responds. More complex, but more adaptive.
Cascade: starts cheap, escalates on low confidence. Smart, but requires confidence metrics.

What we did is the simplest variant: static routing. No ML classifier, no routing framework, no additional infrastructure. One decision, made once, encoded in one line of YAML. Deliberately boring -- and deliberately effective.

Here's something interesting: Anthropic practices this internally. Claude Code's own Explore agent runs on Haiku by default -- the smallest model in the Claude family. If Anthropic uses model tiering for its own agents, that should be a strong signal.

Our Model Assignment: The Result

After the benchmark, we explicitly assigned models to all 9 agents in our architecture:

Model	Agents	Tasks
Sonnet	7	Research, knowledge base queries, content creation, report drafting, channel routing, report generation, evaluations
Haiku	2	Data refresh (simple lookups), transcript extraction

The logic:

Sonnet for everything that needs structured reasoning but no novel-reasoning breakthrough. That covers the majority of agentic work: following instructions, coordinating tools, collecting and structuring data, writing reports. Exactly the tasks where Sonnet 4.6 matches or beats Opus in benchmarks.

Haiku for everything that's essentially transformation: moving data from A to B, converting one format to another, extracting a transcript. Tasks where even the smallest model is reliable because the cognitive demand is low.

Opus stays available for the orchestrator -- the main Claude Code process that launches and coordinates the agents. Where strategic decisions are made, where context is complex, and where novel reasoning actually makes a difference.

The DORA Perspective: Measuring AI Agent Performance

A thought I can't shake: we've been measuring software delivery performance with DORA metrics for years. Deployment Frequency, Lead Time for Changes, Change Failure Rate, Mean Time to Recovery -- and in recent DORA reports, Reliability as a fifth metric. The research shows that high-performing teams are better across all dimensions, not just one.

Why aren't we measuring AI agent performance the same way?

The analogy maps almost 1:1:

DORA Metric	AI Agent Equivalent
Deployment Frequency	Agent invocation frequency
Lead Time for Changes	Time per agent task
Change Failure Rate	Agent error rate / compliance rate
Mean Time to Recovery	Recovery after failed runs
Reliability	Consistency of output quality

Our benchmark essentially did exactly that: systematically compared throughput (tokens, duration), error behavior (outlier analysis), and output quality (evaluation depth, compliance). The DORA 2025 insight that "AI amplifies team dysfunction as often as capability" fits perfectly: a poorly configured multi-agent system doesn't get better with a more powerful model -- it gets more expensive.

My Take

I'll be honest: I should have run this benchmark sooner. We ran every agent on Opus for weeks because it was the default and because the results were good. But "good" doesn't mean "optimal." And "default" doesn't mean "correctly configured."

What surprised me wasn't the cost savings (those were expected) but how little quality we lost. Or more precisely: how different the quality is. Sonnet doesn't produce "worse" results -- it produces "different" ones. More selective, more concise, better signal-to-noise. For most of our use cases, that's an upgrade, not a downgrade.

What gives me pause: more and more teams are running multiple model families in parallel. But very few make deliberate decisions about which model handles which task. That's like giving every Kubernetes pod the same resource limits because you're too busy to measure actual usage. It works -- until the bill arrives.

Multi-agent architectures are becoming the standard in enterprise AI. And all those deployments will reach this inflection point. The question isn't whether, it's when. Teams that start measuring and optimizing early will have a structural advantage -- not just on costs, but on the quality of their outputs.

The market is moving fast. But according to industry surveys, roughly half of deployed agents operate in silos, without coordinated management. Model routing is a first, simple step out of that silo: when you start thinking deliberately about model assignment, you automatically start thinking about agent architecture. And that's the real win.

Limitations

These results apply primarily to our Technology Researcher (tool-heavy, web research, knowledge base writes) and our specific prompt and cache mix. Other agents -- more output-heavy, more planning, less tool usage -- may show different cost and quality profiles. Our sample size (n=9 per group) is at the lower end of statistical reliability; larger n would narrow confidence intervals.

What This Means for Engineering Teams

Three concrete recommendations:

1. Measure Before You Optimize

No benchmark, no decision. We needed 18 full runs to get reliable data. Fewer than 10 runs per variant aren't statistically meaningful enough, and even our 9 per group are on the low end. Track at minimum: total tokens, API calls, duration, and a meaningful quality metric for your use case.

2. Start with Static Routing

Dynamic routing and cascade architectures sound appealing, but they add complexity that's unnecessary in most cases. Static routing -- assigning a model per agent based on the task profile -- is the 80/20 solution. Zero overhead, zero infrastructure, one line of config.

The rule of thumb that worked for us:

Haiku: Transformation, extraction, simple lookups
Sonnet: Structured reasoning, tool coordination, report generation
Opus: Novel reasoning, strategic decisions, complex multi-step planning

3. Review Quarterly

Models improve. Sonnet 4.6 isn't Sonnet 4.0. What needs Opus today might run on Sonnet in six months. And what needs Sonnet today might run on Haiku soon. A quarterly review of model assignments belongs in your team routine, just like reviewing Kubernetes resource requests.

The First Step

If you're using Claude Code with agents: open the agent file of the agent that consumes the most tokens. Add model: sonnet. Run 10 iterations. Compare tokens, duration, and output quality. Done.

If the quality holds, you just cut your costs in half. If it doesn't, you now have valuable data about exactly where Opus makes the difference -- and can invest deliberately instead of blanket-spending.

That's the AI equivalent of right-sizing in Kubernetes: not the biggest model for every task, but the right model for the right task. One line of YAML, ~57% lower costs per agent run -- with identical workflow compliance and a deliberate trade-off in research breadth.

Not bad for one line of config.

References

Anthropic Models and Claude Code

Claude Sonnet 4.6 Announcement (February 17, 2026) -- benchmark comparisons Sonnet vs. Opus
Claude Sonnet 4.6 System Card -- detailed benchmark data, safety and capability evaluation
Claude Opus 4.6 System Card (PDF) -- Opus benchmark data and safety evaluation
Claude Model Overview -- model specifications and capabilities
Claude Code Agents Documentation -- agent frontmatter, model: field
Claude Code Sub-Agents -- model configuration, default behavior, fallback
Claude Code Security and Sandboxing -- sandbox mode, permission system

Model Routing and Cost Optimization

FrugalGPT: How to Use Large Language Models While Reducing Cost (Stanford, 2023) -- 50-98% cost reduction via model routing
Anthropic API Pricing -- token pricing for Opus, Sonnet, Haiku
Anthropic Pricing Docs -- detailed pricing structure incl. cache formulas
Prompt Caching -- cache read/write multipliers and TTLs

Multi-Agent Architectures

DORA Report 2025 -- expanded metrics, reliability as fifth measure

Infralovers Trainings

AI Coding Essentials (IL-AICE) -- 2-day training on AI-powered software development with Claude Code, GitHub Copilot, and more
AI Essentials for Engineers (IL-AIEE) -- 2-day training on AI fundamentals for DevOps and platform engineers

Go Back explore our courses

AI Coding Essentials

Discover how to leverage AI tools to enhance coding efficiency, automate repetitive tasks, and unlock innovative development workflows in this hands-on session.

AI Essentials for Engineers

Transform your engineering workflows with hands-on AI: Deploy LLMs, automate infrastructure, and master the latest tools and protocols for secure, compliant, and efficient operations.

Edmund Haselwanter | 19.02.2026 artificial intelligence, devops

57% Cost Cut: Model Routing for Multi-Agent Systems

One line of YAML. That was it. 1model: sonnet This single line in an agent's frontmatter reduced our per-run costs for one of the most-used agents in our

Edmund Haselwanter | 15.02.2026 security, artificial intelligence

Sandboxing Claude Code on macOS: What I Actually Found

If you've used Claude Code for more than a day, you know the drill. Every Bash command, every file write outside the working directory, every network call --

Matthias Theuermann | 09.02.2026 artificial intelligence, devops

Dynamic Context Tuning: Smarter Chatbot Context Resolution Without the LLM Overhead

The Problem With "That Product" Multi-turn conversations are natural for humans but surprisingly tricky for chatbots. When a user asks "What's

Jürgen Brüder | 06.02.2026 cloud native, infrastructure as code

Cloud Sovereignty: Why EU Companies Need to Rethink Their Cloud Strategy Now

If you work in IT infrastructure in Europe, you have probably noticed a shift in the conversation over the past two years. Cloud sovereignty is no longer a

Jürgen Brüder | 29.01.2026 artificial intelligence, devops

Claude Code vs OpenCode: Which Agentic CLI Fits Your Workflow?

If you’ve been using AI in software engineering for a while, you know the real productivity jump doesn’t come from "chatting about code". It comes

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.