57% Cost Cut: Model Routing for Multi-Agent Systems
One line of YAML. That was it. 1model: sonnet This single line in an agent's frontmatter reduced our per-run costs for one of the most-used agents in our

If you've used Claude Code for more than a day, you know the drill. Every Bash command, every file write outside the working directory, every network call -- "Allow this? Allow once? Allow always?" You start carefully curating your permission rules. This npm install is fine. That docker build is fine. This curl to an API -- probably fine? You're spending mental energy on access control instead of on the problem you're solving.
It works. It's safe. And it's slow.
At some point you realize you're the bottleneck. Claude Code is waiting for you to approve commands faster than you can read them. So you start loosening the rules, allowing broader patterns, maybe even reaching for --dangerously-skip-permissions on your own machine, for your own projects. That's a personal risk calculation, and for solo work it can be fine.
But there's a deeper problem. The permission-prompt model assumes you're watching. It assumes interactive, iterative prompting where you and the agent take turns. That works for "fix this bug" or "refactor this function." It breaks down the moment you want Claude Code to actually run -- spec-driven workflows where you hand off an implementation task and walk away. Background agents building and testing in parallel. Agents that need Docker to spin up test environments, a web browser for verification, MCP servers for external data. The kind of agentic work where the whole point is that you're not approving every step.
That's where I found myself. Moving from interactive Claude Code to agentic workflows for the Infralovers team -- trainers, consultants, people running real customer workloads. And I realized: the answer isn't better permission rules. It's isolation. Real isolation, at the infrastructure level, so the agent can run freely inside a boundary that protects everything outside it.
So I went down the rabbit hole. Weeks of reading, testing, comparing. Here's the progression I landed on.
Let's start with what Anthropic already ships. Native sandboxing in Claude Code -- type /sandbox and it enables OS-level isolation. Under the hood, this relies on Anthropic's open-source sandbox runtime: on macOS it uses sandbox-exec (Apple's Seatbelt framework), on Linux it uses bubblewrap. Both restrict filesystem writes to the current working directory and filter network access through a proxy allowlist.
One thing worth noting: Apple marks sandbox-exec as deprecated in its man page. It still works today, and Anthropic's runtime depends on it, but it's a long-term uncertainty for any tooling built on top of it.
Anthropic reports roughly 84% fewer permission prompts in their internal usage after enabling sandboxing. The underlying runtime is open-sourced.
For the classic iterative workflow -- you prompt, Claude edits, you review, repeat -- the built-in sandbox is a genuine upgrade. Most of the annoying "may I write this file?" prompts disappear. You stay in flow. If all you need is a smart pair programmer that edits code and runs tests, /sandbox might be all you need. Stop reading here, you're done.
Still reading? Then you've probably hit the same wall I did.
The moment you move to agentic workflows, the built-in sandbox hits hard limits:
excludedCommands -- which undermines the point of having a sandbox.sandbox-exec.--dangerously-skip-permissions or a sandbox that makes prompts unnecessary.For us at Infralovers, that's most of what we do. Our agents build Docker images, call CRM APIs via MCP, run integration tests in containers. The built-in sandbox covers maybe 40% of our workflows. The interesting 60% needs something else.
The real question became: What's the right isolation level for a team environment where Claude Code needs Docker, MCP, and network access -- and runs without permission prompts?
Three approaches stand out: Docker Sandboxes (microVM, best DX, requires Docker Desktop license), Lima (open source, CNCF, full control, more setup), and Tart (most secure defaults, ideal for CI). Each has different trade-offs -- here are the details.
Once you accept that the agent needs to operate without permission prompts, the security model flips. Instead of "approve each action" you need "contain all actions." The VM boundary becomes your security boundary.
I evaluated everything I could find. Here's what actually works.
Before diving into the individual tools, this diagram shows the key architectural difference between the approaches:
Docker Sandboxes creates a microVM per sandbox with its own private Docker daemon. Lima gives you a single VM where Claude runs inside. Colima wraps Lima to give you a Docker-Desktop-like experience on the host -- it's a different use case.
A common misconception: Docker Sandboxes is not "Claude Code running in a Docker container with Docker-in-Docker." That would be a container running another Docker daemon inside the same kernel -- fragile and insecure. Docker Sandboxes is fundamentally different.
Docker has iterated on Sandboxes over several releases; the current microVM-based architecture requires Docker Desktop 4.58+ and is still marked Experimental. Earlier "legacy" sandbox approaches existed, but the microVM architecture is the real security boundary shift -- and the most polished solution I've found.
Each sandbox runs in a dedicated microVM -- not a container, a full virtual machine with its own Linux kernel and its own Docker daemon. That's the key distinction. When Claude Code runs inside a Docker Sandbox, it can use Docker freely because it has its own isolated Docker environment. It can build images, run docker-compose, spin up test databases -- everything an agentic workflow needs. But it can't reach the host's Docker daemon, other sandboxes, host localhost services, or files outside the synced workspace.
Workspace directories sync bidirectionally between host and sandbox. Network traffic routes through an HTTP/HTTPS proxy where you can set domain allow/deny lists. Claude Code launches with --dangerously-skip-permissions by default -- because the whole point is that the sandbox provides the safety boundary, not the permission prompts.
Sandboxes don't show up in docker ps -- because each sandbox runs its own Docker daemon that the host daemon doesn't know about. They're managed via docker sandbox ls.
Matt Pocock called it "the best DX of any local AI coding sandbox". The team at Arcade.dev tested it and said "we forgot we were working inside a sandbox." That tracks with my experience -- the friction is minimal.
The sharp edges:
make aren't pre-installed, so environment parity needs attentionFor us at Infralovers, the licensing isn't a blocker. We're a small team. But I wanted to understand the alternatives anyway.
If Docker Desktop licensing is a problem, or if you just prefer open-source tooling, Lima (~20k stars as of Feb 2026, CNCF project) is the answer.
But first, a distinction that trips up almost everyone:
docker build from your Mac terminal.For running Claude Code inside a VM, you don't need Colima. The pattern is simpler: start a Lima VM, install Claude Code inside it, and let it work directly in the VM. Lima has an AI agents example that demonstrates exactly this. Colima only becomes relevant if you also want a convenient Docker-on-Mac setup alongside the Lima VM -- a different use case.
Lima launches Linux VMs using Apple's Virtualization.framework (vmType: vz, the default since Lima v1.0 on macOS 13.5+). CPU performance is near-native (Apple's Virtualization.framework avoids the overhead layer of traditional emulators). Boot takes around 30 seconds.
One feature worth highlighting: limactl shell --sync (Lima 2.1+). Instead of live-syncing changes to your host filesystem, it stages them. When the agent finishes, you get an "Accept changes?" prompt with a diff -- review, accept, or discard. Think of it as a filesystem-level code review for agent output. This is a genuinely different safety model from Docker Sandboxes' bidirectional sync.
Trail of Bits maintains a security-hardened devcontainer configuration with explicit Colima support, recommending vz + virtiofs + rosetta. This is a solid starting point.
But here's what I learned the hard way about Lima's defaults: Lima's default instance template mounts host paths -- often from $HOME -- as read-only. Which paths are mounted depends on the template, but the out-of-the-box experience often means SSH keys, .env files, cloud credentials, git configs -- potentially readable by code running inside the VM. For normal development work, that's a convenience feature. For Claude Code with --dangerously-skip-permissions, it's a security hole.
A hardened configuration for Claude Code should:
The filesystem performance story is the main trade-off here. Virtiofs bind mounts run noticeably slower than native for metadata-heavy operations (see the benchmark section below). The workaround is exactly what the hardened config does: clone repos inside the VM where you get native ext4 speed. If you need host-side file access (editing in VS Code on the host), the hybrid approach -- bind mount for source code, Docker volume for node_modules -- benchmarks at roughly 1.1-1.3x native.
I'll be honest, I spent a couple of days designing a Colima + Incus architecture. Incus system containers inside a Lima VM. Elegant layering. Defense in depth. I was excited about it.
Then I actually thought it through.
Incus containers share the VM's kernel. They use the same namespace and cgroup primitives as Docker. The Incus layer adds UID remapping and nice operational features like snapshots and cloning, but the meaningful isolation boundary is still the Lima VM itself. As security researchers have pointed out, Incus containers won't add meaningful security beyond Docker since they use the same underlying mechanisms, leaving kernel exploits equally viable.
What about running Incus VMs (not containers) for true hypervisor-level isolation? That requires nested virtualization, which needs M3+ Apple Silicon. M1 and M2 Macs can't do it. That kills team-wide deployment right there -- I'm not going to mandate a hardware upgrade for a sandboxing strategy.
The conclusion: drop the Incus layer and use Docker directly inside the Lima VM. Same security boundary, dramatically less complexity. Nobody in the community is running the Colima + Incus stack for Claude Code, and now I understand why.
I have to mention OrbStack because the performance numbers are ridiculous. 2-second boot (vs. 30s for Lima). 40% less RAM than Docker Desktop. 75-95% of native filesystem performance for operations like pnpm install (12.2s vs. 10.9s native). It achieves this through a custom VirtioFS implementation with dynamic caching.
But OrbStack runs all Linux machines on a shared kernel (similar to WSL2). Isolation is container-level, not VM-level. File sharing is bidirectional and cannot currently be disabled per-machine (open feature request #169). For untrusted code isolation, you'd need containers inside an OrbStack machine, adding a layer. And it's closed-source with commercial licensing.
For trusted development work: OrbStack is phenomenal. For sandboxing an AI agent running --dangerously-skip-permissions: the isolation model isn't strong enough.
Tart from Cirrus Labs deserves mention because its defaults are the most security-conscious of any option. No filesystem mounts by default. A userspace packet filter called Softnet that restricts VM networking with configurable CIDR allow-lists. VMs distributable as OCI images (push/pull from container registries), which makes environment standardization straightforward.
The trade-off is convenience: no automatic port forwarding, no automatic file sharing, and a CI-focused design that lacks interactive development ergonomics. For automated Claude Code workflows in CI/CD pipelines, Tart might actually be the best choice. For daily interactive development, it's too much friction.
Beyond the major VM/container options, several focused tools have emerged:
sandbox-exec natively, Docker as fallback. Zero-config, one command. Clever naming, useful tool.$HOME, not just writes. This matters for preventing prompt-injection-based data exfiltration. Most sandboxes focus on write restrictions but forget that reading your SSH keys is already game over.I want to be direct about this because most sandbox comparisons focus on the theoretical VM escape scenario. The attack surface is relatively small (only virtio devices, no legacy emulation), but you should assume hypervisors can have bugs and keep shared surface area minimal. Apple patches sandbox and isolation bypasses in macOS regularly; it would be reckless to treat any isolation layer as unbreakable.
The practical risks are entirely about what gets shared into the sandbox:
Write restrictions are nice; read restrictions are often more important. A read of ~/.ssh plus outbound network access equals instant exfiltration. Most sandboxes focus on preventing writes but forget that reading your SSH keys, cloud credentials, or git tokens is already game over.
Filesystem sharing is the primary attack vector. Lima's default $HOME mount, OrbStack's bidirectional sharing that can't be disabled, any bind mount that includes credential directories. SSH keys, cloud credentials, .env files, git tokens. An agent doesn't need to escape the VM if you've already handed it your secrets.
Network access is the secondary vector. Anthropic's own documentation states it plainly: "Without network isolation, a compromised agent could exfiltrate sensitive files. Without filesystem isolation, a compromised agent could backdoor system resources." Docker Sandboxes handles this with HTTP proxy domain filtering. Tart has Softnet for packet-level filtering. Lima and OrbStack? You need to configure iptables inside the guest yourself.
Container escapes are real. A critical August 2025 vulnerability illustrates this: CVE-2025-9074 (CVSS 9.3) exposed Docker Desktop's internal Engine API to any container at 192.168.65.7:2375 without authentication. On macOS this meant escape from container to the Docker Desktop VM and control of its daemon -- not direct host access, since the hypervisor layer still sat between the VM and macOS. But it underscores why microVM isolation (Docker Sandboxes, Lima, Tart) provides a fundamentally stronger boundary than container-only approaches: even when a container escape happens, the blast radius stays inside the VM.
Every macOS VM solution hits the same wall: crossing the hypervisor boundary for I/O costs roughly 3x performance on metadata-heavy workloads.
The best independent benchmark I found is from Paolo Mainardi (January 2025), measuring npm install for a React app on an M4 Pro with bind mounts:
| Platform | Time | Notes |
|---|---|---|
| Docker Desktop (VZ + sync, paid feature) | 3.88s | Fastest, but requires paid subscription |
| OrbStack 1.9.2 | 4.22s | Fastest free option |
| Linux Docker 27.3.1 (bare metal ThreadRipper) | 5.29s | Reference: native Linux |
| Docker Desktop (VMM, beta) | 8.47s | |
| Lima 1.0.3 | 8.99s | |
| Docker Desktop (VZ, standard) | 9.53s | Default config, ~3x slower than sync |
OrbStack's own benchmarks claim 75-95% of native performance for pnpm install (12.2s vs. 10.9s native) -- impressive but note the vendor source. The DDEV community benchmarks (Randy Fay, Nov 2023) confirmed OrbStack as the fastest provider with Mutagen disabled.
The implication for Claude Code: clone repos directly inside the VM and push changes via git. When data lives inside the VM, you get native ext4 speed -- the ~3x penalty only hits bind mounts crossing the hypervisor boundary. For workflows where you need host-side file access (editing in your native IDE), OrbStack's optimized mounts or Docker's file sync feature are acceptable trade-offs.
On the virtiofs improvement arc: Docker's switch from gRPC-FUSE to virtiofs delivered up to 90% speedups on heavy workloads (confirmed independently by Jeff Geerling at ~3.7x faster). Bind mounts went from ~5-6x slower than native to ~3x -- better, but native parity remains elusive.
After weeks of testing and research, here's how I'd score these options across the dimensions that matter for our use case. The weights reflect our specific priorities: unattended Docker + networked MCP in a small team setting. Your weights will differ -- adjust accordingly.
| Criterion (weight) | Docker Sandboxes | Lima/Colima hardened | OrbStack + Docker | Tart VM | Colima + Incus | DevContainer |
|---|---|---|---|---|---|---|
| Isolation strength (25%) | 9 | 8 | 5 | 9 | 7 | 6 |
| Docker-in-sandbox (20%) | 10 | 9 | 8 | 9 | 8 | 6 |
| Filesystem performance (15%) | 7 | 6/9 | 9 | 6/9 | 6/9 | 7 |
| Setup complexity (15%) | 9 | 5 | 8 | 5 | 3 | 7 |
| Team standardization (10%) | 8 | 7 | 7 | 8 | 5 | 8 |
| Open source / licensing (10%) | 4 | 10 | 4 | 6 | 10 | 9 |
| Community adoption (5%) | 8 | 7 | 7 | 4 | 2 | 8 |
| Weighted total | 8.1 | 7.3 | 6.7 | 7.2 | 5.9 | 6.9 |
Notes on the scores: Incus containers get a 7 for isolation (shared kernel), Incus VMs would get a 9 but require M3+ chips. Docker-in-Docker is possible in DevContainers but requires privileged mode. Filesystem scores split between bind mounts (6) and cloning inside the VM (9).
Docker Sandboxes is the best developer experience. But "best DX" and "what we can recommend to customers" aren't the same thing.
For our own team at Infralovers, Docker Sandboxes is what I'm currently evaluating. The microVM isolation model is exactly right: own Docker daemon, network policy enforcement, workspace syncing, one command to spin up. The licensing works for our team size.
But we're a training and consulting company. We don't just pick tools for ourselves -- we need to understand and recommend solutions that our customers can actually adopt. And the real world out there is complicated.
The licensing barrier is real. Docker Desktop is free only for organizations with fewer than 250 employees AND less than $10M revenue. Many of our enterprise customers -- banking, telco, manufacturing -- blow past both thresholds. At ~$21/user/month for Docker Business, a 250-person engineering org is looking at roughly $63K/year just for the container runtime. That's not a technical argument, it's a procurement conversation. And in regulated industries, procurement conversations can take months.
Customers already have other tools. Walk into a RHEL shop and they're running Podman Desktop, not Docker Desktop. Podman is the default container runtime in RHEL 8+ and comes free with the subscription they're already paying for. It's rootless by default (no daemon running as root), daemonless (smaller attack surface), and ships FIPS 140-2 compliant on RHEL. For organizations with SOC 2, ISO 27001, or NIS2 requirements, those aren't nice-to-haves -- they're checkboxes.
Podman on macOS uses the same Apple Virtualization.framework as Lima, running a Fedora CoreOS VM (default provider: applehv, alternatively libkrun or qemu). The isolation model is comparable. Community projects like claude-podman and claudeman already run Claude Code in Podman, and the textcortex/claude-code-sandbox project auto-detects Podman sockets. It works -- but it's not as polished as Docker Sandboxes, and there's no equivalent to Docker's microVM-per-sandbox architecture yet.
So where does that leave us?
The honest answer is: there's no single "right" tool. There's the right tool for a specific team's existing stack, licensing constraints, and security requirements. As a consultant, I find that answer unsatisfying. As someone who's been in enough enterprise environments to know better, I find it accurate.
The progression is clearer in hindsight than it was while living it:
/sandbox removes most of the friction for iterative coding. Genuine improvement, no infrastructure needed.The critical insight is embarrassingly simple: the VM boundary is the security boundary. Everything else -- Incus containers, Docker namespaces, devcontainer configs, macOS Seatbelt profiles -- provides defense-in-depth, but not a fundamentally different isolation tier.
Apple's Virtualization.framework makes this VM boundary lightweight enough (30-second boot, near-native CPU, 75-95% native filesystem with the right approach) that the old trade-off between isolation and performance has largely dissolved. The remaining friction is filesystem sharing, and cloning inside the VM eliminates that entirely.
Where you land on this progression depends on how you use Claude Code. If it's your pair programmer -- /sandbox and you're done. If it's your autonomous build system that needs Docker, MCP servers, and network access -- you need Level 3 isolation. Which Level 3 tool? That depends on what's already in your stack and who signs off on new software.
I spent too long designing elaborate container-in-VM-in-VM architectures when the answer was much simpler. And then I spent more time learning that "simpler" still means "different for every customer." Aber so ist das halt -- manchmal muss man den komplizierten Weg gehen, um den einfachen zu finden. Und dann stellt sich raus, dass es mehrere einfache gibt.
Has anyone else gone through this progression? I'd especially love to hear from teams running agentic workflows in enterprise environments -- what's your container runtime, and did your sandboxing choice survive first contact with procurement?
/sandbox command details$HOME read accessYou are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.
Contact us