Local LLM Inference with the Mac Mini: Our Evaluation and Where It Fits

Matthias Theuermann | 24.02.2026 artificial intelligence, devops

Every AI-assisted tool your team uses - coding agents, chatbots, workflow automations - sends data to someone else's server. Every prompt, every code snippet, every internal document. For many companies, that's fine. For others, it's a dealbreaker.

We've looked at a range of approaches - from pure cloud APIs to hybrid setups to fully self-hosted inference. What we found: the right answer depends heavily on the use case. This post describes one option we evaluated as a proof of concept: a local LLM endpoint running on Apple Silicon hardware. It's a good starting point for small teams, individual developers, or anyone who wants to experiment without a big upfront commitment. If your team is growing, agentic coding is becoming a core workflow, or you need something that several developers hammer simultaneously all day - this likely isn't your endpoint solution.

The Problem with Cloud-Only AI

Cloud AI APIs are convenient. They're also:

A privacy concern: your code, prompts, and internal data leave your network with every request
Unpredictably expensive: costs scale linearly with usage, and once a team adopts AI tooling, usage grows fast
A compliance question: depending on your industry, sending data to third-party APIs may require legal review or may not be allowed at all
External dependency: large cloud providers are generally very reliable - but availability, rate limits, and API changes are outside your control. With self-hosted inference, you own that.

None of this means cloud AI is bad. It means relying on it exclusively creates risks that are worth understanding.

Mac Mini as an Evaluation Option: What Makes Apple Silicon Interesting

When evaluating local inference hardware, most teams look at GPU servers first - understandably, since that's been the standard for a long time. But there are interesting alternatives worth considering. One of them is Apple Silicon, for an architectural reason that's relevant to LLM inference.

The Mac Mini with Apple Silicon (M-series) is interesting for this use case because of Apple's unified memory architecture: the CPU and GPU share the same RAM pool. For LLM inference, this matters - the model weights need to be accessible to the GPU, and on traditional hardware that means dedicated VRAM. On Apple Silicon, your 32GB of unified memory is your VRAM.

Performance benchmarks

On our Mac Mini M4 with 32GB RAM running Ollama:

Model	Size	Use Case
qwen2.5-coder:7b	4.7 GB	Primary coding model - strong code generation and understanding
mistral:latest 7b	4.4 GB	General purpose - Q&A, summarization, reasoning
gemma3:latest 4b	3.3 GB	General purpose - fast and versatile
qwen2.5-coder:1.5b	986 MB	Lightweight coding - fast completions, low memory

All four models combined use under 14GB of disk space and fit comfortably in 32GB of unified memory. Even the larger 7B models produce output faster than you can read it.

Generally you could use even bigger models without any problems. The Mac Mini M4 with 32GB can easily handle models with up to 20B parameters (it can also handle models with more parameters like Qwen3 Coder 30B), as long as they are optimized for inference and fit within the memory constraints.

Ollama as the AI Engine: One Option Worth Evaluating

For serving models locally, there are several options - llama.cpp for low-level control, LM Studio for a GUI-driven experience, or running models directly via frameworks like vLLM. We evaluated a few of these and landed on Ollama for our setup - mainly because of how little friction it adds. If you want a deeper look at Ollama's latest features, we covered that in our Ollama 2025 updates post.

The key feature for this setup: Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. This means any tool that speaks the OpenAI API format - coding agents, IDE extensions, workflow automation platforms, custom scripts - can connect to your Mac Mini without modification.

Setting it up

The setup is straightforward: We install Ollama and pull a few models:

1# Install Ollama (macOS)
2curl -fsSL https://ollama.com/install.sh | sh
3
4# Pull models for different use cases
5ollama pull qwen2.5-coder:7b    # Primary coding model
6ollama pull mistral:latest       # General purpose
7ollama pull gemma3:latest        # General purpose, fast
8ollama pull qwen2.5-coder:1.5b  # Lightweight, fast coding

To serve on the network (not just localhost), we set the host binding:

1# In ~/.zshrc or as a launchd environment variable
2export OLLAMA_HOST=0.0.0.0

Then we restart Ollama, and it now listens on all interfaces at port 11434.

Making It Reachable: Headscale VPN

Setting OLLAMA_HOST=0.0.0.0 makes Ollama available on your local network. But you don't want it on the public internet - there's no authentication built in, and exposing an unauthenticated API endpoint is a security risk.

The approach we chose: a mesh VPN as a trusted perimeter. If a device is on the network, it has already authenticated. Specifically, Headscale - the open-source, self-hosted implementation of the Tailscale control server.

Security note: The VPN-as-perimeter approach works well for restricted environments and first steps. For production-closer setups, consider adding RBAC and TLS at the API level (e.g. via nginx or an API gateway) and granular Access Control Lists on the Tailscale/Headscale side.

How it works

Tailscale creates a peer-to-peer VPN (a "tailnet") using WireGuard encryption. Devices on the tailnet can reach each other directly, without port forwarding, firewall rules, or exposing services to the internet.

Headscale replaces Tailscale's cloud-based coordination server with one you host yourself. The coordination server handles device registration and key exchange - it never sees your actual traffic (that flows directly between devices via WireGuard).

The setup:

Run Headscale on a small server (a VPS with a public IP works well)
Install the Tailscale client on the Mac Mini and each team member's machine
Point clients at your Headscale server instead of Tailscale's cloud
The Mac Mini gets a stable Tailscale IP (e.g., 100.64.0.10)

Now every team member can reach Ollama at http://100.64.0.10:11434 - encrypted, authenticated, and invisible to the public internet.

Tailscale vs Headscale

If self-hosting the coordination server sounds like more work than you want, Tailscale's cloud service is a perfectly valid alternative. It's free for personal use and reasonably priced for teams. The trade-off: device metadata (not traffic) passes through Tailscale's servers.

We use Headscale at Infralovers because we prefer full control over the coordination layer. But for smaller teams or quick setups, Tailscale's hosted option gets you running in minutes.

What This Setup Solves for Us - and What It Doesn't

For our specific requirements, this setup brings the following advantages:

Data privacy: prompts and code never leave your network. No third-party sees your data.
Cost predictability: one-time hardware cost. No per-token billing, no surprise invoices.
Low latency: local network round-trips are measured in single-digit milliseconds.
Compliance-friendly: easier to satisfy data residency requirements when data stays on-premise.
Always available: no dependency on external API uptime or rate limits.

Where This Setup Has Limits

This isn't a silver bullet. Be clear-eyed about what you're getting:

Accuracy: local and open source models are not as capable as the latest GPT or Claude Opus models. For complex reasoning, novel architecture decisions, or tasks that benefit from frontier-scale models, cloud APIs are still better.
Concurrent users: with 32GB RAM and multiple models available, there's decent headroom - but heavy simultaneous usage can still cause slowdowns or model unloading.
Operational responsibility: you own the hardware, the updates, and the uptime. When Ollama releases a new version or a model needs updating, that's on you.
Context window: Ollama has default context windows, based on your available VRAM. You can increase this, but larger contexts use more memory.

Conclusion

For us, the takeaway is clear: it's not "local vs. cloud" but hybrid - local for routine, high-volume tasks where privacy and cost matter, cloud for cases where frontier model quality makes a real difference.

Whether this approach fits your team depends on your specific requirements - team size, usage patterns, compliance needs, and how much operational responsibility you want to take on. We see this setup as one option in the toolbox, not a universal solution.

Go Back explore our courses

AI Essentials for Engineers

Transform your engineering workflows with hands-on AI: Deploy LLMs, automate infrastructure, and master the latest tools and protocols.

AI Coding Essentials

Leverage AI tools to enhance coding efficiency, automate repetitive tasks, and unlock innovative development workflows.

Marina Brooks | 04.03.2026 artificial intelligence, devops, platform engineering

Continuing education is a factor in employee satisfaction

Learning & Development is a Satisfaction Driver—and a Competitive Advantage In many organizations, learning and development is still treated as a

Edmund Haselwanter | 03.03.2026 artificial intelligence, devops

Conway's AI-Inverse: Small Teams, Big Monoliths

In the late 1960s, Melvin Conway submits a paper on computer manufacturers and compiler design to the Harvard Business Review. They reject it -- insufficient

Matthias Theuermann | 03.03.2026 artificial intelligence, devops

Evaluating a Company-Internal AI Stack: Mac Mini + OpenCode + Headscale

In the previous posts in this series, we looked at setting up a Mac Mini M4 with Ollama behind a Headscale VPN as a local LLM endpoint and OpenCode as a CLI

Matthias Theuermann | 27.02.2026 artificial intelligence, devops

OpenCode: AI-Assisted Coding with Free and Local LLMs

Most AI coding assistants are tightly coupled to a single provider. Switching often means changing tools, reconfiguring workflows, or losing context. OpenCode

Edmund Haselwanter | 25.02.2026 artificial intelligence, devops

Dark Factory Gap: What Happens to Teams, Roles, and Organizations

In Part 1 of this series, we worked through the why: Shapiro's five levels of AI development, Brynjolfsson's J-Curve, and the core thesis that AI tools alone

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.