Local LLM Inference with the Mac Mini: Our Evaluation and Where It Fits


Bicycle

Every AI-assisted tool your team uses - coding agents, chatbots, workflow automations - sends data to someone else's server. Every prompt, every code snippet, every internal document. For many companies, that's fine. For others, it's a dealbreaker.

We've looked at a range of approaches - from pure cloud APIs to hybrid setups to fully self-hosted inference. What we found: the right answer depends heavily on the use case. This post describes one option we evaluated as a proof of concept: a local LLM endpoint running on Apple Silicon hardware. It's a good starting point for small teams, individual developers, or anyone who wants to experiment without a big upfront commitment. If your team is growing, agentic coding is becoming a core workflow, or you need something that several developers hammer simultaneously all day - this likely isn't your endpoint solution.

The Problem with Cloud-Only AI

Cloud AI APIs are convenient. They're also:

  • A privacy concern: your code, prompts, and internal data leave your network with every request
  • Unpredictably expensive: costs scale linearly with usage, and once a team adopts AI tooling, usage grows fast
  • A compliance question: depending on your industry, sending data to third-party APIs may require legal review or may not be allowed at all
  • External dependency: large cloud providers are generally very reliable - but availability, rate limits, and API changes are outside your control. With self-hosted inference, you own that.

None of this means cloud AI is bad. It means relying on it exclusively creates risks that are worth understanding.

Mac Mini as an Evaluation Option: What Makes Apple Silicon Interesting

When evaluating local inference hardware, most teams look at GPU servers first - understandably, since that's been the standard for a long time. But there are interesting alternatives worth considering. One of them is Apple Silicon, for an architectural reason that's relevant to LLM inference.

The Mac Mini with Apple Silicon (M-series) is interesting for this use case because of Apple's unified memory architecture: the CPU and GPU share the same RAM pool. For LLM inference, this matters - the model weights need to be accessible to the GPU, and on traditional hardware that means dedicated VRAM. On Apple Silicon, your 32GB of unified memory is your VRAM.

Performance benchmarks

On our Mac Mini M4 with 32GB RAM running Ollama:

ModelSizeUse Case
qwen2.5-coder:7b4.7 GBPrimary coding model - strong code generation and understanding
mistral:latest 7b4.4 GBGeneral purpose - Q&A, summarization, reasoning
gemma3:latest 4b3.3 GBGeneral purpose - fast and versatile
qwen2.5-coder:1.5b986 MBLightweight coding - fast completions, low memory

All four models combined use under 14GB of disk space and fit comfortably in 32GB of unified memory. Even the larger 7B models produce output faster than you can read it.

Generally you could use even bigger models without any problems. The Mac Mini M4 with 32GB can easily handle models with up to 20B parameters (it can also handle models with more parameters like Qwen3 Coder 30B), as long as they are optimized for inference and fit within the memory constraints.

Ollama as the AI Engine: One Option Worth Evaluating

For serving models locally, there are several options - llama.cpp for low-level control, LM Studio for a GUI-driven experience, or running models directly via frameworks like vLLM. We evaluated a few of these and landed on Ollama for our setup - mainly because of how little friction it adds. If you want a deeper look at Ollama's latest features, we covered that in our Ollama 2025 updates post.

The key feature for this setup: Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. This means any tool that speaks the OpenAI API format - coding agents, IDE extensions, workflow automation platforms, custom scripts - can connect to your Mac Mini without modification.

Setting it up

The setup is straightforward: We install Ollama and pull a few models:

1# Install Ollama (macOS)
2curl -fsSL https://ollama.com/install.sh | sh
3
4# Pull models for different use cases
5ollama pull qwen2.5-coder:7b    # Primary coding model
6ollama pull mistral:latest       # General purpose
7ollama pull gemma3:latest        # General purpose, fast
8ollama pull qwen2.5-coder:1.5b  # Lightweight, fast coding

To serve on the network (not just localhost), we set the host binding:

1# In ~/.zshrc or as a launchd environment variable
2export OLLAMA_HOST=0.0.0.0

Then we restart Ollama, and it now listens on all interfaces at port 11434.

Making It Reachable: Headscale VPN

Setting OLLAMA_HOST=0.0.0.0 makes Ollama available on your local network. But you don't want it on the public internet - there's no authentication built in, and exposing an unauthenticated API endpoint is a security risk.

The approach we chose: a mesh VPN as a trusted perimeter. If a device is on the network, it has already authenticated. Specifically, Headscale - the open-source, self-hosted implementation of the Tailscale control server.

Security note: The VPN-as-perimeter approach works well for restricted environments and first steps. For production-closer setups, consider adding RBAC and TLS at the API level (e.g. via nginx or an API gateway) and granular Access Control Lists on the Tailscale/Headscale side.

How it works

Tailscale creates a peer-to-peer VPN (a "tailnet") using WireGuard encryption. Devices on the tailnet can reach each other directly, without port forwarding, firewall rules, or exposing services to the internet.

Headscale replaces Tailscale's cloud-based coordination server with one you host yourself. The coordination server handles device registration and key exchange - it never sees your actual traffic (that flows directly between devices via WireGuard).

The setup:

  1. Run Headscale on a small server (a VPS with a public IP works well)
  2. Install the Tailscale client on the Mac Mini and each team member's machine
  3. Point clients at your Headscale server instead of Tailscale's cloud
  4. The Mac Mini gets a stable Tailscale IP (e.g., 100.64.0.10)

Now every team member can reach Ollama at http://100.64.0.10:11434 - encrypted, authenticated, and invisible to the public internet.

Tailscale vs Headscale

If self-hosting the coordination server sounds like more work than you want, Tailscale's cloud service is a perfectly valid alternative. It's free for personal use and reasonably priced for teams. The trade-off: device metadata (not traffic) passes through Tailscale's servers.

We use Headscale at Infralovers because we prefer full control over the coordination layer. But for smaller teams or quick setups, Tailscale's hosted option gets you running in minutes.

What This Setup Solves for Us - and What It Doesn't

For our specific requirements, this setup brings the following advantages:

  • Data privacy: prompts and code never leave your network. No third-party sees your data.
  • Cost predictability: one-time hardware cost. No per-token billing, no surprise invoices.
  • Low latency: local network round-trips are measured in single-digit milliseconds.
  • Compliance-friendly: easier to satisfy data residency requirements when data stays on-premise.
  • Always available: no dependency on external API uptime or rate limits.

Where This Setup Has Limits

This isn't a silver bullet. Be clear-eyed about what you're getting:

  • Accuracy: local and open source models are not as capable as the latest GPT or Claude Opus models. For complex reasoning, novel architecture decisions, or tasks that benefit from frontier-scale models, cloud APIs are still better.
  • Concurrent users: with 32GB RAM and multiple models available, there's decent headroom - but heavy simultaneous usage can still cause slowdowns or model unloading.
  • Operational responsibility: you own the hardware, the updates, and the uptime. When Ollama releases a new version or a model needs updating, that's on you.
  • Context window: Ollama has default context windows, based on your available VRAM. You can increase this, but larger contexts use more memory.

Conclusion

For us, the takeaway is clear: it's not "local vs. cloud" but hybrid - local for routine, high-volume tasks where privacy and cost matter, cloud for cases where frontier model quality makes a real difference.

Whether this approach fits your team depends on your specific requirements - team size, usage patterns, compliance needs, and how much operational responsibility you want to take on. We see this setup as one option in the toolbox, not a universal solution.

Go Back explore our courses

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.

Contact us