Evaluating a Company-Internal AI Stack: Mac Mini + OpenCode + Headscale


Bicycle

In the previous posts in this series, we looked at setting up a Mac Mini M4 with Ollama behind a Headscale VPN as a local LLM endpoint and OpenCode as a CLI coding agent with multiple providers.

This post puts the pieces together - not as a blueprint, but as a proof-of-concept evaluation: we built and tested this combination at Infralovers to find out where the limits are and which use cases it's actually suited for. Spoiler: it works well as a starting point, less so once your team grows or agentic coding becomes a heavy daily workflow.

One key insight upfront: we don't run everything locally. OpenCode is configured with two providers - the local Ollama endpoint for everyday work and Anthropic's API for tasks that require deeper reasoning. That hybrid approach is the core of the setup.

The Test Setup

Here's what we built and tested:

  • Mac Mini M4 (32GB) in our office, running Ollama with several models
  • Headscale (self-hosted) as our VPN coordination server
  • Tailscale clients on every developer's machine
  • OpenCode on each developer's laptop, configured to use the shared Ollama endpoint

The Mac Mini is our test device for exactly this question: what's actually possible with a small, affordable Apple Silicon machine - and is the investment worth it?

Headscale + Mac Mini + OpenCode Setup

Each developer can choose their own client tool. OpenCode, VS Code with Continue, JetBrains with AI plugins, or even curl - it doesn't matter. The Ollama endpoint speaks the OpenAI API format, so anything that can talk to OpenAI can talk to our Mac Mini.

Step-by-Step: Connecting OpenCode to the Company Endpoint

1. Ensure the Mac Mini is on the tailnet

The Mac Mini runs Ollama with OLLAMA_HOST=0.0.0.0 and is connected to our Headscale VPN. Its Tailscale IP is stable (e.g., 100.64.0.10), so it is always reachable. Alternatively, Tailscale offers DNS and lets you use a hostname like mac-mini.your-tailnet.ts.net instead of a fixed IP.

2. Configure OpenCode

On each developer's machine, we create a file called ~/.config/opencode/opencode.json:

 1{
 2  "$schema": "https://opencode.ai/config.json",
 3  "provider": {
 4    "company-ollama": {
 5      "npm": "@ai-sdk/openai-compatible",
 6      "name": "Infralovers LLM",
 7      "options": {
 8        "baseURL": "http://100.64.0.10:11434/v1"
 9      },
10      "models": {
11        "qwen3-coder:30b": {
12          "name": "Qwen3 Coder 30B (Company)"
13        },
14        "llama3.1:8b": {
15          "name": "Llama 3.1 8B (Company)"
16        }
17      }
18    },
19    "anthropic": {
20      "name": "Anthropic",
21      "models": {
22        "claude-sonnet-4-5-20250929": {
23          "name": "Claude 4.5 Sonnet"
24        }
25      }
26    }
27  }
28}

Notice: we define two providers in the same config. The company Ollama endpoint for daily work, and Anthropic's API for when we need frontier-model reasoning. Developers switch between them with /models in OpenCode.

3. Verify the connection

All you need to get started, is a successful connection through your Tailscale client. Run tailscale status to check connectivity, then start OpenCode and select the company model:

 1# Make sure you're connected to the tailnet
 2tailscale status
 3
 4# Test the Ollama endpoint
 5curl http://100.64.0.10:11434/v1/models | jq
 6
 7# Start OpenCode and select the company model
 8opencode
 9> /models company-ollama/qwen3-coder:30b
10> Hello, can you see my project?

That's the baseline. No complex setup, no API keys for the local endpoint, no billing configuration. Connect to the VPN, start coding - and see where it works and where the limits show up.

The Hybrid Strategy

We don't use local models for everything - and testing quickly showed that wouldn't make sense anyway. As a rough working framework, we've been distinguishing where local inference holds up and where it runs into limits. This is a working hypothesis, not a measured split:

Local endpoint (routine tasks - tends to be the majority)

  • Code completion and generation: writing boilerplate, generating tests, implementing straightforward features
  • Refactoring: renaming, restructuring, extracting functions
  • Documentation: generating docstrings, README sections, inline comments
  • Debugging: reading error logs, suggesting fixes for common issues
  • Code review assistance: checking for obvious issues, style consistency

Cloud API (when local models hit their limits)

  • Architecture decisions: designing systems, evaluating trade-offs
  • Complex debugging: multi-file issues, subtle logic errors
  • Security review: analyzing authentication flows, checking for vulnerabilities
  • Novel problem-solving: tasks that require broad knowledge or creative approaches
  • Long-context tasks: analyzing large codebases that exceed local model context limits

How the work actually splits depends heavily on the developer, the task, and the models in use. We don't have reliable measurements yet - this is the state of our ongoing evaluation. What we have concretely observed: when 3+ developers hit the same 14B model simultaneously, response times increase noticeably. With 32GB RAM, the headroom for parallel inference with larger models is limited.

What Other Tools Can Connect

The beauty of running an OpenAI-compatible endpoint is that it's not limited to OpenCode. Here's what else we (and you) can connect:

Continue (VS Code / JetBrains)

Continue is an open-source AI code assistant that runs as an IDE extension. We can also point it at the Ollama endpoint:

1{
2  "models": [{
3    "title": "Company LLM",
4    "provider": "ollama",
5    "model": "qwen3-coder:30b",
6    "apiBase": "http://100.64.0.10:11434"
7  }]
8}

n8n Workflow Automation

n8n can use the Ollama endpoint for AI-powered workflows: automated code review, documentation generation, ticket summarization, and more. The AI Agent node connects to any OpenAI-compatible endpoint.

Custom Scripts

Any script using the OpenAI Python or JavaScript SDK works by changing the base URL:

 1from openai import OpenAI
 2
 3client = OpenAI(
 4    base_url="http://100.64.0.10:11434/v1",
 5    api_key="not-needed"  # Ollama doesn't require a key
 6)
 7
 8response = client.chat.completions.create(
 9    model="qwen3-coder:30b",
10    messages=[{"role": "user", "content": "Review this code: ..."}]
11)

MCP Servers

With MCP support in both OpenCode and other tools, you can extend the AI's capabilities with custom tools - database access, internal APIs, documentation search - all routed through your private endpoint.

Lessons Learned

After running this setup for several months, here's what we've learned:

What works well

  • The hybrid approach matters: Trying to do everything locally is frustrating. Accepting that local models handle routine work well and that cloud APIs are an option for the rest gives a better developer experience - but where exactly that line falls depends on the use case.
  • Headscale is rock-solid: once configured, it just works. We haven't had a single VPN-related issue.
  • Developer adoption is fast: because the tools (OpenCode, Continue) are familiar and the endpoint "just works" behind the VPN, there's no learning curve beyond choosing a model.

What to watch out for

  • Context window limits: Ollama has default context lengths that can be too small for many coding tasks. It is possible to increase it, but this uses more memory.
  • Model updates require attention: when Ollama releases updated model weights, someone needs to pull them. We run a simple cron job: ollama pull qwen3-coder:14b weekly.
  • Not all models support tool calling: for OpenCode's agentic features (file editing, command execution), the model must support tool calling. Qwen3-Coder and DeepSeek-Coder handle this well. Some smaller models don't.

When local isn't enough

Local inference has real limits. We reach for cloud APIs when:

  • A task requires reasoning over large amounts of code
  • The problem is genuinely novel and requires broad knowledge
  • Accuracy matters more than speed (security reviews, production deployments)
  • A local model gives obviously wrong answers after 2-3 attempts

The strength of this setup isn't that local replaces cloud. It's that local handles the volume, and cloud handles the complexity.

Evaluation Summary

Building a stack like this doesn't require a server room, a dedicated ops team, or a six-figure budget. For a proof of concept or a small team, the bar to entry is genuinely low. Here's what we found:

  • Privacy: code never leaves your network for routine tasks
  • Cost efficiency: one-time hardware investment replaces ongoing API bills
  • Flexibility: any tool, any model, any provider - your choice
  • Team-wide access: the VPN makes the endpoint available to everyone, everywhere

That said: if your team scales up, or agentic coding (longer autonomous runs, parallel agents, large context tasks) becomes a central part of your workflow, you'll outgrow a single Mac Mini fairly quickly. At that point, the conversation shifts - either toward more capable local hardware, or toward cloud APIs as the primary inference layer.

This setup works for us at Infralovers - but we also continuously evaluate frontier approaches for ourselves and our clients. Claude Code, Codex CLI, Bob, and others. Not because local isn't good enough, but because every team has individual requirements and preferences. What works well for one team may not be an option for another - an existing vendor relationship, compliance constraints, or simply different priorities. That variety is normal, and it's exactly why knowing multiple options is worth the effort.

Go Back explore our courses

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.

Contact us