Sandboxing Claude Code on macOS: What I Actually Found


Bicycle

If you've used Claude Code for more than a day, you know the drill. Every Bash command, every file write outside the working directory, every network call -- "Allow this? Allow once? Allow always?" You start carefully curating your permission rules. This npm install is fine. That docker build is fine. This curl to an API -- probably fine? You're spending mental energy on access control instead of on the problem you're solving.

It works. It's safe. And it's slow.

At some point you realize you're the bottleneck. Claude Code is waiting for you to approve commands faster than you can read them. So you start loosening the rules, allowing broader patterns, maybe even reaching for --dangerously-skip-permissions on your own machine, for your own projects. That's a personal risk calculation, and for solo work it can be fine.

But there's a deeper problem. The permission-prompt model assumes you're watching. It assumes interactive, iterative prompting where you and the agent take turns. That works for "fix this bug" or "refactor this function." It breaks down the moment you want Claude Code to actually run -- spec-driven workflows where you hand off an implementation task and walk away. Background agents building and testing in parallel. Agents that need Docker to spin up test environments, a web browser for verification, MCP servers for external data. The kind of agentic work where the whole point is that you're not approving every step.

That's where I found myself. Moving from interactive Claude Code to agentic workflows for the Infralovers team -- trainers, consultants, people running real customer workloads. And I realized: the answer isn't better permission rules. It's isolation. Real isolation, at the infrastructure level, so the agent can run freely inside a boundary that protects everything outside it.

So I went down the rabbit hole. Weeks of reading, testing, comparing. Here's the progression I landed on.

Level 1: The built-in sandbox -- and why it's genuinely good

Let's start with what Anthropic already ships. Native sandboxing in Claude Code -- type /sandbox and it enables OS-level isolation. Under the hood, this relies on Anthropic's open-source sandbox runtime: on macOS it uses sandbox-exec (Apple's Seatbelt framework), on Linux it uses bubblewrap. Both restrict filesystem writes to the current working directory and filter network access through a proxy allowlist.

One thing worth noting: Apple marks sandbox-exec as deprecated in its man page. It still works today, and Anthropic's runtime depends on it, but it's a long-term uncertainty for any tooling built on top of it.

Anthropic reports roughly 84% fewer permission prompts in their internal usage after enabling sandboxing. The underlying runtime is open-sourced.

For the classic iterative workflow -- you prompt, Claude edits, you review, repeat -- the built-in sandbox is a genuine upgrade. Most of the annoying "may I write this file?" prompts disappear. You stay in flow. If all you need is a smart pair programmer that edits code and runs tests, /sandbox might be all you need. Stop reading here, you're done.

Still reading? Then you've probably hit the same wall I did.

Level 2: Where the built-in sandbox stops

The moment you move to agentic workflows, the built-in sandbox hits hard limits:

  • Docker workflows force you to punch holes. Docker often requires host interactions and privileged operations that don't play nicely with the sandbox constraints, so you end up excluding Docker commands via excludedCommands -- which undermines the point of having a sandbox.
  • No MCP server access. Agents that query CRM data, call external APIs, or draft emails need network access to services that, in my experience, can't be allowlisted granularly enough through the sandbox.
  • No browser automation. Verification workflows that open a browser, check a deployment, screenshot a result -- severely limited under sandbox-exec.
  • Background agents can't prompt. If you're running agents in the background (the whole point of spec-driven workflows), there's nobody to click "Allow." You need either --dangerously-skip-permissions or a sandbox that makes prompts unnecessary.

For us at Infralovers, that's most of what we do. Our agents build Docker images, call CRM APIs via MCP, run integration tests in containers. The built-in sandbox covers maybe 40% of our workflows. The interesting 60% needs something else.

The real question became: What's the right isolation level for a team environment where Claude Code needs Docker, MCP, and network access -- and runs without permission prompts?

Level 3: Real isolation -- letting agents run free

Three approaches stand out: Docker Sandboxes (microVM, best DX, requires Docker Desktop license), Lima (open source, CNCF, full control, more setup), and Tart (most secure defaults, ideal for CI). Each has different trade-offs -- here are the details.

Once you accept that the agent needs to operate without permission prompts, the security model flips. Instead of "approve each action" you need "contain all actions." The VM boundary becomes your security boundary.

I evaluated everything I could find. Here's what actually works.

Before diving into the individual tools, this diagram shows the key architectural difference between the approaches:

flowchart LR subgraph macOS["macOS Host"] DEV["Developer / VS Code"] end subgraph DockerSandboxes["Docker Sandboxes"] S1["microVM Sandbox 1\nprivate Docker daemon"] S2["microVM Sandbox 2\nprivate Docker daemon"] end subgraph Lima["Lima (direct)"] VM1["Linux VM\nClaude + Docker inside"] end subgraph Colima["Colima"] COL["CLI wrapper\n'Containers on Lima'"] VM2["Lima VM\nDocker socket exposed"] end DEV -->|"docker sandbox run"| S1 DEV -->|"docker sandbox run"| S2 DEV -->|"limactl start"| VM1 COL -->|starts + configures| VM2 DEV -->|"colima start"| COL

Docker Sandboxes creates a microVM per sandbox with its own private Docker daemon. Lima gives you a single VM where Claude runs inside. Colima wraps Lima to give you a Docker-Desktop-like experience on the host -- it's a different use case.

Docker Sandboxes: the closest thing to "just works"

A common misconception: Docker Sandboxes is not "Claude Code running in a Docker container with Docker-in-Docker." That would be a container running another Docker daemon inside the same kernel -- fragile and insecure. Docker Sandboxes is fundamentally different.

Docker has iterated on Sandboxes over several releases; the current microVM-based architecture requires Docker Desktop 4.58+ and is still marked Experimental. Earlier "legacy" sandbox approaches existed, but the microVM architecture is the real security boundary shift -- and the most polished solution I've found.

Each sandbox runs in a dedicated microVM -- not a container, a full virtual machine with its own Linux kernel and its own Docker daemon. That's the key distinction. When Claude Code runs inside a Docker Sandbox, it can use Docker freely because it has its own isolated Docker environment. It can build images, run docker-compose, spin up test databases -- everything an agentic workflow needs. But it can't reach the host's Docker daemon, other sandboxes, host localhost services, or files outside the synced workspace.

Workspace directories sync bidirectionally between host and sandbox. Network traffic routes through an HTTP/HTTPS proxy where you can set domain allow/deny lists. Claude Code launches with --dangerously-skip-permissions by default -- because the whole point is that the sandbox provides the safety boundary, not the permission prompts.

Sandboxes don't show up in docker ps -- because each sandbox runs its own Docker daemon that the host daemon doesn't know about. They're managed via docker sandbox ls.

Matt Pocock called it "the best DX of any local AI coding sandbox". The team at Arcade.dev tested it and said "we forgot we were working inside a sandbox." That tracks with my experience -- the friction is minimal.

The sharp edges:

  • Requires Docker Desktop, which is free only (under current terms) for organizations with fewer than 250 employees AND less than $10M annual revenue -- everyone else needs a paid subscription
  • macOS or Windows only (Linux uses a weaker container-based approach)
  • Not available on OrbStack (confirmed open issue #2295)
  • Missing tools like make aren't pre-installed, so environment parity needs attention
  • Credential management takes thought -- you're explicitly deciding what the sandbox can access
  • Disk footprint: Each microVM brings its own Linux kernel and Docker daemon. Running multiple sandboxes in parallel adds up quickly

For us at Infralovers, the licensing isn't a blocker. We're a small team. But I wanted to understand the alternatives anyway.

The open-source path: Lima and Colima

If Docker Desktop licensing is a problem, or if you just prefer open-source tooling, Lima (~20k stars as of Feb 2026, CNCF project) is the answer.

But first, a distinction that trips up almost everyone:

  • Lima is the base tool. It starts and manages Linux VMs on macOS. Period.
  • Colima ("Containers on Lima") is a wrapper that gives you a Docker-Desktop-like experience -- it creates a Lima VM, installs Docker inside it, and exposes the Docker socket to your host so you can run docker build from your Mac terminal.

For running Claude Code inside a VM, you don't need Colima. The pattern is simpler: start a Lima VM, install Claude Code inside it, and let it work directly in the VM. Lima has an AI agents example that demonstrates exactly this. Colima only becomes relevant if you also want a convenient Docker-on-Mac setup alongside the Lima VM -- a different use case.

Lima launches Linux VMs using Apple's Virtualization.framework (vmType: vz, the default since Lima v1.0 on macOS 13.5+). CPU performance is near-native (Apple's Virtualization.framework avoids the overhead layer of traditional emulators). Boot takes around 30 seconds.

One feature worth highlighting: limactl shell --sync (Lima 2.1+). Instead of live-syncing changes to your host filesystem, it stages them. When the agent finishes, you get an "Accept changes?" prompt with a diff -- review, accept, or discard. Think of it as a filesystem-level code review for agent output. This is a genuinely different safety model from Docker Sandboxes' bidirectional sync.

Trail of Bits maintains a security-hardened devcontainer configuration with explicit Colima support, recommending vz + virtiofs + rosetta. This is a solid starting point.

But here's what I learned the hard way about Lima's defaults: Lima's default instance template mounts host paths -- often from $HOME -- as read-only. Which paths are mounted depends on the template, but the out-of-the-box experience often means SSH keys, .env files, cloud credentials, git configs -- potentially readable by code running inside the VM. For normal development work, that's a convenience feature. For Claude Code with --dangerously-skip-permissions, it's a security hole.

A hardened configuration for Claude Code should:

  • Remove all home directory mounts entirely
  • Install Docker inside the VM
  • Clone repos directly in the VM (not via bind mounts)
  • Add iptables rules for outbound network filtering

The filesystem performance story is the main trade-off here. Virtiofs bind mounts run noticeably slower than native for metadata-heavy operations (see the benchmark section below). The workaround is exactly what the hardened config does: clone repos inside the VM where you get native ext4 speed. If you need host-side file access (editing in VS Code on the host), the hybrid approach -- bind mount for source code, Docker volume for node_modules -- benchmarks at roughly 1.1-1.3x native.

The proposal I had to talk myself out of: Colima + Incus

I'll be honest, I spent a couple of days designing a Colima + Incus architecture. Incus system containers inside a Lima VM. Elegant layering. Defense in depth. I was excited about it.

Then I actually thought it through.

Incus containers share the VM's kernel. They use the same namespace and cgroup primitives as Docker. The Incus layer adds UID remapping and nice operational features like snapshots and cloning, but the meaningful isolation boundary is still the Lima VM itself. As security researchers have pointed out, Incus containers won't add meaningful security beyond Docker since they use the same underlying mechanisms, leaving kernel exploits equally viable.

What about running Incus VMs (not containers) for true hypervisor-level isolation? That requires nested virtualization, which needs M3+ Apple Silicon. M1 and M2 Macs can't do it. That kills team-wide deployment right there -- I'm not going to mandate a hardware upgrade for a sandboxing strategy.

The conclusion: drop the Incus layer and use Docker directly inside the Lima VM. Same security boundary, dramatically less complexity. Nobody in the community is running the Colima + Incus stack for Claude Code, and now I understand why.

OrbStack: the speed demon with a caveat

I have to mention OrbStack because the performance numbers are ridiculous. 2-second boot (vs. 30s for Lima). 40% less RAM than Docker Desktop. 75-95% of native filesystem performance for operations like pnpm install (12.2s vs. 10.9s native). It achieves this through a custom VirtioFS implementation with dynamic caching.

But OrbStack runs all Linux machines on a shared kernel (similar to WSL2). Isolation is container-level, not VM-level. File sharing is bidirectional and cannot currently be disabled per-machine (open feature request #169). For untrusted code isolation, you'd need containers inside an OrbStack machine, adding a layer. And it's closed-source with commercial licensing.

For trusted development work: OrbStack is phenomenal. For sandboxing an AI agent running --dangerously-skip-permissions: the isolation model isn't strong enough.

Tart: the one I keep coming back to for CI

Tart from Cirrus Labs deserves mention because its defaults are the most security-conscious of any option. No filesystem mounts by default. A userspace packet filter called Softnet that restricts VM networking with configurable CIDR allow-lists. VMs distributable as OCI images (push/pull from container registries), which makes environment standardization straightforward.

The trade-off is convenience: no automatic port forwarding, no automatic file sharing, and a CI-focused design that lacks interactive development ergonomics. For automated Claude Code workflows in CI/CD pipelines, Tart might actually be the best choice. For daily interactive development, it's too much friction.

The community landscape

Beyond the major VM/container options, several focused tools have emerged:

  • nikvdp/cco ("Claude Condom") -- auto-detects the best available sandbox backend. On macOS it uses sandbox-exec natively, Docker as fallback. Zero-config, one command. Clever naming, useful tool.
  • textcortex/claude-code-sandbox -- web UI with auto-push and multi-container management
  • RchGrav/claudebox -- 15+ language-specific development profiles for Docker-based sandboxing
  • neko-kai/claude-code-sandbox -- uniquely blocks read access to $HOME, not just writes. This matters for preventing prompt-injection-based data exfiltration. Most sandboxes focus on write restrictions but forget that reading your SSH keys is already game over.

Where the real security risks hide

I want to be direct about this because most sandbox comparisons focus on the theoretical VM escape scenario. The attack surface is relatively small (only virtio devices, no legacy emulation), but you should assume hypervisors can have bugs and keep shared surface area minimal. Apple patches sandbox and isolation bypasses in macOS regularly; it would be reckless to treat any isolation layer as unbreakable.

The practical risks are entirely about what gets shared into the sandbox:

Write restrictions are nice; read restrictions are often more important. A read of ~/.ssh plus outbound network access equals instant exfiltration. Most sandboxes focus on preventing writes but forget that reading your SSH keys, cloud credentials, or git tokens is already game over.

Filesystem sharing is the primary attack vector. Lima's default $HOME mount, OrbStack's bidirectional sharing that can't be disabled, any bind mount that includes credential directories. SSH keys, cloud credentials, .env files, git tokens. An agent doesn't need to escape the VM if you've already handed it your secrets.

Network access is the secondary vector. Anthropic's own documentation states it plainly: "Without network isolation, a compromised agent could exfiltrate sensitive files. Without filesystem isolation, a compromised agent could backdoor system resources." Docker Sandboxes handles this with HTTP proxy domain filtering. Tart has Softnet for packet-level filtering. Lima and OrbStack? You need to configure iptables inside the guest yourself.

Container escapes are real. A critical August 2025 vulnerability illustrates this: CVE-2025-9074 (CVSS 9.3) exposed Docker Desktop's internal Engine API to any container at 192.168.65.7:2375 without authentication. On macOS this meant escape from container to the Docker Desktop VM and control of its daemon -- not direct host access, since the hypervisor layer still sat between the VM and macOS. But it underscores why microVM isolation (Docker Sandboxes, Lima, Tart) provides a fundamentally stronger boundary than container-only approaches: even when a container escape happens, the blast radius stays inside the VM.

The filesystem performance reality

Every macOS VM solution hits the same wall: crossing the hypervisor boundary for I/O costs roughly 3x performance on metadata-heavy workloads.

The best independent benchmark I found is from Paolo Mainardi (January 2025), measuring npm install for a React app on an M4 Pro with bind mounts:

PlatformTimeNotes
Docker Desktop (VZ + sync, paid feature)3.88sFastest, but requires paid subscription
OrbStack 1.9.24.22sFastest free option
Linux Docker 27.3.1 (bare metal ThreadRipper)5.29sReference: native Linux
Docker Desktop (VMM, beta)8.47s
Lima 1.0.38.99s
Docker Desktop (VZ, standard)9.53sDefault config, ~3x slower than sync

OrbStack's own benchmarks claim 75-95% of native performance for pnpm install (12.2s vs. 10.9s native) -- impressive but note the vendor source. The DDEV community benchmarks (Randy Fay, Nov 2023) confirmed OrbStack as the fastest provider with Mutagen disabled.

The implication for Claude Code: clone repos directly inside the VM and push changes via git. When data lives inside the VM, you get native ext4 speed -- the ~3x penalty only hits bind mounts crossing the hypervisor boundary. For workflows where you need host-side file access (editing in your native IDE), OrbStack's optimized mounts or Docker's file sync feature are acceptable trade-offs.

On the virtiofs improvement arc: Docker's switch from gRPC-FUSE to virtiofs delivered up to 90% speedups on heavy workloads (confirmed independently by Jeff Geerling at ~3.7x faster). Bind mounts went from ~5-6x slower than native to ~3x -- better, but native parity remains elusive.

The scoring matrix

After weeks of testing and research, here's how I'd score these options across the dimensions that matter for our use case. The weights reflect our specific priorities: unattended Docker + networked MCP in a small team setting. Your weights will differ -- adjust accordingly.

Criterion (weight)Docker SandboxesLima/Colima hardenedOrbStack + DockerTart VMColima + IncusDevContainer
Isolation strength (25%)985976
Docker-in-sandbox (20%)1098986
Filesystem performance (15%)76/996/96/97
Setup complexity (15%)958537
Team standardization (10%)877858
Open source / licensing (10%)41046109
Community adoption (5%)877428
Weighted total8.17.36.77.25.96.9

Notes on the scores: Incus containers get a 7 for isolation (shared kernel), Incus VMs would get a 9 but require M3+ chips. Docker-in-Docker is possible in DevContainers but requires privileged mode. Filesystem scores split between bind mounts (6) and cloning inside the VM (9).

What I'm evaluating -- and why it's not simple

Docker Sandboxes is the best developer experience. But "best DX" and "what we can recommend to customers" aren't the same thing.

For our own team at Infralovers, Docker Sandboxes is what I'm currently evaluating. The microVM isolation model is exactly right: own Docker daemon, network policy enforcement, workspace syncing, one command to spin up. The licensing works for our team size.

But we're a training and consulting company. We don't just pick tools for ourselves -- we need to understand and recommend solutions that our customers can actually adopt. And the real world out there is complicated.

The licensing barrier is real. Docker Desktop is free only for organizations with fewer than 250 employees AND less than $10M revenue. Many of our enterprise customers -- banking, telco, manufacturing -- blow past both thresholds. At ~$21/user/month for Docker Business, a 250-person engineering org is looking at roughly $63K/year just for the container runtime. That's not a technical argument, it's a procurement conversation. And in regulated industries, procurement conversations can take months.

Customers already have other tools. Walk into a RHEL shop and they're running Podman Desktop, not Docker Desktop. Podman is the default container runtime in RHEL 8+ and comes free with the subscription they're already paying for. It's rootless by default (no daemon running as root), daemonless (smaller attack surface), and ships FIPS 140-2 compliant on RHEL. For organizations with SOC 2, ISO 27001, or NIS2 requirements, those aren't nice-to-haves -- they're checkboxes.

Podman on macOS uses the same Apple Virtualization.framework as Lima, running a Fedora CoreOS VM (default provider: applehv, alternatively libkrun or qemu). The isolation model is comparable. Community projects like claude-podman and claudeman already run Claude Code in Podman, and the textcortex/claude-code-sandbox project auto-detects Podman sockets. It works -- but it's not as polished as Docker Sandboxes, and there's no equivalent to Docker's microVM-per-sandbox architecture yet.

So where does that leave us?

  • For our own team: Docker Sandboxes is the current frontrunner. Best DX, strong isolation, acceptable licensing.
  • For customer recommendations: It depends on what they already have. Podman shop? Lima/Colima hardened with the Trail of Bits devcontainer config is the open-source path. Docker Desktop already in place? Docker Sandboxes is the obvious upgrade.
  • For the future: Apple announced native Containerization at WWDC 2025 -- a dedicated lightweight VM per container, sub-second startup, minimal attack surface. If that ships with macOS 26, it could become the best sandboxing option for any runtime. But it's not here yet.

The honest answer is: there's no single "right" tool. There's the right tool for a specific team's existing stack, licensing constraints, and security requirements. As a consultant, I find that answer unsatisfying. As someone who's been in enough enterprise environments to know better, I find it accurate.

The insight that took me too long to reach

The progression is clearer in hindsight than it was while living it:

  1. Permission prompts work when you're watching. They're the seatbelt for interactive use.
  2. Built-in /sandbox removes most of the friction for iterative coding. Genuine improvement, no infrastructure needed.
  3. VM isolation is what you need the moment "agentic" stops being a buzzword and starts being how you actually work -- agents running in the background, building containers, calling external APIs, implementing specs while you review something else.

The critical insight is embarrassingly simple: the VM boundary is the security boundary. Everything else -- Incus containers, Docker namespaces, devcontainer configs, macOS Seatbelt profiles -- provides defense-in-depth, but not a fundamentally different isolation tier.

Apple's Virtualization.framework makes this VM boundary lightweight enough (30-second boot, near-native CPU, 75-95% native filesystem with the right approach) that the old trade-off between isolation and performance has largely dissolved. The remaining friction is filesystem sharing, and cloning inside the VM eliminates that entirely.

Where you land on this progression depends on how you use Claude Code. If it's your pair programmer -- /sandbox and you're done. If it's your autonomous build system that needs Docker, MCP servers, and network access -- you need Level 3 isolation. Which Level 3 tool? That depends on what's already in your stack and who signs off on new software.

I spent too long designing elaborate container-in-VM-in-VM architectures when the answer was much simpler. And then I spent more time learning that "simpler" still means "different for every customer." Aber so ist das halt -- manchmal muss man den komplizierten Weg gehen, um den einfachen zu finden. Und dann stellt sich raus, dass es mehrere einfache gibt.

Has anyone else gone through this progression? I'd especially love to hear from teams running agentic workflows in enterprise environments -- what's your container runtime, and did your sandboxing choice survive first contact with procurement?

References

Official documentation

VM and container runtimes

  • Lima -- CNCF project for Linux VMs on macOS
  • Colima -- Lima wrapper with Docker/containerd/Incus support
  • OrbStack -- fast Linux VM and container management for macOS
  • Tart -- Apple Virtualization.framework VMs for CI
  • Podman Desktop -- open-source, rootless, daemonless container management
  • How Podman runs on Macs -- Red Hat's architecture overview (applehv, libkrun)

Community sandbox projects

Benchmarks and performance

Security references

Evaluations and reviews

Go Back explore our courses

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.

Contact us