LiteLLM: Flexible and Secure LLM Access for Organizations
Introduction As organizations increasingly adopt AI-powered solutions, providing secure and flexible access to large language models (LLMs) becomes a critical
As organizations increasingly adopt AI-powered solutions, providing secure and flexible access to large language models (LLMs) becomes a critical challenge. LiteLLM is an open-source tool designed to simplify and standardize LLM access for companies, teams, and developers. It acts as a unified gateway, enabling organizations to manage, monitor, and optimize LLM usage across cloud and on-premise environments.
LiteLLM offers a single API endpoint compatible with the OpenAI API, allowing seamless integration with a wide range of LLM providers. This means you can switch between or combine models from OpenAI, Azure, Anthropic, Google, Cohere, and more—without changing your application code.
Security and data privacy are top concerns for many organizations. LiteLLM supports integration with local LLMs, such as those served by Ollama, enabling you to run models entirely on your own infrastructure. This is ideal for:
LiteLLM provides built-in features for rate limiting, quota management, and usage tracking. Organizations can:
Because LiteLLM is API-compatible with OpenAI, existing applications and tools that work with OpenAI can be redirected to LiteLLM with minimal changes. This makes it easy to:
Below is a realistic minimal setup you can adapt to provide a unified internal LLM endpoint that: (1) serves OpenAI models for high-quality reasoning, (2) serves local Ollama models for privacy / cost-sensitive workloads, (3) applies routing + fallback, and (4) exposes a single OpenAI-compatible API to all internal teams.
http://llm-gateway.internal/v1
base_url
).llm-gateway/
.env
docker-compose.yml
litellm_config.yaml
.env
(example)OPENAI_API_KEY=sk-live-openai-xxxxxxxxxxxxxxxx
LITELLM_MASTER_KEY=sk-internal-master-key # master token for admin / service calls
LITELLM_PORT=4000
ENABLE_METRICS=true # exposes /metrics (Prometheus)
# Optional: per-user keys you mint & store elsewhere (DB / Vault)
litellm_config.yaml
1# Core list of logical model names your org will use
2model_list:
3 # High quality reasoning (cloud)
4 - model_name: gpt-enterprise
5 litellm_params:
6 model: openai/gpt-4o
7 # Cost-optimized summarization (cloud, with caching)
8 - model_name: summarizer
9 litellm_params:
10 model: openai/gpt-4o-mini
11 caching: true
12 # Private secure processing (local via Ollama)
13 - model_name: secure-private
14 litellm_params:
15 model: ollama/llama3
16 api_base: http://ollama:11434 # service name inside compose
17 api_key: null # Ollama usually does not require a key
18 # Lightweight classification (local)
19 - model_name: classification
20 litellm_params:
21 model: ollama/mistral
22 api_base: http://ollama:11434
23
24# Optional routing + fallback strategies
25router_settings:
26 routing_strategy: usage_based_routing # or round_robin / least_latency
27 fallback_strategy:
28 - openai/gpt-4o -> openai/gpt-4o-mini -> ollama/llama3
29
30# (Early governance concept) - illustrative only; real budgets often stored in DB
31team_config:
32 - team_id: marketing
33 max_budget: 20 # (units defined by your accounting process)
34 models: [gpt-enterprise, summarizer]
35 - team_id: engineering
36 max_budget: 50
37 models: [secure-private, gpt-enterprise, classification]
38
39# Enable structured logging / metrics
40general_settings:
41 enable_langfuse: false
42 enable_otel: false
docker-compose.yml
1version: '3.9'
2services:
3 litellm:
4 image: ghcr.io/berriai/litellm:latest
5 ports:
6 - "4000:4000"
7 env_file: .env
8 volumes:
9 - ./litellm_config.yaml:/app/litellm_config.yaml:ro
10 command: ["--config", "/app/litellm_config.yaml", "--port", "4000"]
11 depends_on:
12 - ollama
13 ollama:
14 image: ollama/ollama:latest
15 ports:
16 - "11434:11434"
17 volumes:
18 - ollama:/root/.ollama
19 restart: unless-stopped
20 # (Optional) Pre-pull models on container start
21 entrypoint: ["/bin/sh","-c"]
22 command: >
23 "ollama serve & sleep 4 && \
24 ollama pull llama3 && \
25 ollama pull mistral && \
26 wait -n"
27volumes:
28 ollama: {}
Start it:
docker compose up -d
Your gateway is now at: http://localhost:4000/v1
(OpenAI-compatible)
Cloud model:
curl -s -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-internal-master-key" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-enterprise",
"messages": [{"role": "user", "content": "Give me a one-sentence update on AI trends."}],
"max_tokens": 150
}' | jq '.choices[0].message.content'
Local private model:
curl -s -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-internal-master-key" \
-H "Content-Type: application/json" \
-d '{
"model": "secure-private",
"messages": [{"role": "user", "content": "Summarize this internal log policy in 3 bullets: <REDACTED TEXT>"}],
"temperature": 0.2
}' | jq '.choices[0].message.content'
1from openai import OpenAI
2
3# Point the OpenAI SDK to LiteLLM gateway
4client = OpenAI(
5 api_key="sk-internal-master-key", # or a per-user scoped key you issue
6 base_url="http://localhost:4000/v1"
7)
8
9resp = client.chat.completions.create(
10 model="gpt-enterprise",
11 messages=[{"role": "user", "content": "Draft a short release note about our new AI gateway."}],
12 max_tokens=200
13)
14print(resp.choices[0].message.content)
15
16# Switch to local secure model (no code changes besides model name)
17secure = client.chat.completions.create(
18 model="secure-private",
19 messages=[{"role": "user", "content": "Summarize internal doc: <CONFIDENTIAL TEXT>"}],
20 temperature=0.3
21)
22print(secure.choices[0].message.content)
If openai/gpt-4o
is temporarily rate-limited, LiteLLM transparently attempts the next fallback (gpt-4o-mini
) and finally a local model — preserving availability while controlling cost exposure.
/metrics
with Prometheus + build a Grafana dashboard (token counts, latency, fallback rates).Running both cloud and local models behind a unified gateway lets you intentionally choose models per use case (latency, cost, privacy) without forcing downstream developers to re-integrate each time strategy changes.
Tip: Start with just two logical model names (e.g.
standard
+secure
) and expand once adoption stabilizes.
Concern | Without Gateway | With LiteLLM |
---|---|---|
API Proliferation | Each provider SDK | Single OpenAI-compatible endpoint |
Model Switching | Code changes | Config / routing change |
Local vs Cloud | Separate integration paths | Unified abstraction |
Fallback | Manual error handling | Declarative strategy |
Governance | Ad hoc scripts | Centralized middleware |
Observability | Fragmented logs | Unified metrics/log stream |
This example can be productionized by adding TLS (reverse proxy), persistent storage for usage, and secret management (Vault / AWS KMS / Azure Key Vault) for API keys.
If you prefer not to operate your own gateway service, a hosted multi-model broker like OpenRouter can be attractive. Here's the distilled decision:
If You Need | Choose | Rationale |
---|---|---|
Local / private models (Ollama, air‑gapped) | LiteLLM | Only LiteLLM lets you route to self-hosted runtimes. |
Zero infra / fastest start | OpenRouter | Fully managed; just one API key + URL. |
Full control over logs, retention, network | LiteLLM | You own the deployment + observability stack. |
Single consolidated billing for many providers | OpenRouter | Aggregated pricing + unified invoice. |
Custom routing / policy enforcement (PII segregation, team budgets) | LiteLLM | Extend config / middleware; run on your compliance boundary. |
Experiment across many hosted foundation models immediately | OpenRouter | Large catalog without per-provider setup. |
Pragmatic hybrid: Run LiteLLM and add OpenRouter as one upstream provider for experimentation while still routing sensitive traffic to local or directly contracted providers.
Implementing LiteLLM in EU/EEA contexts can support GDPR compliance when architected carefully. Key areas to address:
secure-private
).Control | Implement via |
---|---|
Prompt redaction | Pre-middleware (regex + entity detection) |
Residency | Route to secure-private (local) model |
Access control | Per-key team mapping in config / DB |
Retention | Log processor with TTL (e.g. Loki + retention policy) |
Vendor review | Central register + DPA archive |
Disclaimer: This section is informational and not legal advice. Coordinate with your Data Protection Officer / legal counsel for authoritative interpretations.
LiteLLM shifts LLM adoption from ad hoc provider integration toward a governed internal platform. Combined with selective use of OpenRouter or direct provider APIs, it lets you match model choice to data sensitivity, latency and cost—without repeated re-integration effort.
You can approach LiteLLM adoption as an iterative capability build rather than a time‑boxed project. First establish a lean foundation: deploy the gateway with exactly one premium cloud model and one local/privacy model, expose only two logical model names (for example standard
and secure
), issue scoped API keys, and capture just the essentials (tokens, latency, success/error counts). Add a lightweight prompt redaction middleware so early experimentation does not leak obvious PII, then validate value with a sharply defined internal use case such as documentation Q&A or structured summarization.
Once the initial path works reliably, expand horizontally instead of prematurely optimizing. Introduce declarative routing and fallback so quality workloads gracefully degrade to lower cost tiers or local models. Layer in budgeting, anomaly alerts, and richer observability (dashboards, sampled request payload metadata). Grow the model catalog deliberately (e.g. add embeddings, classification, summarization) only when a concrete consumer need appears. In parallel, formalize GDPR / data flow documentation and ensure provider metadata (regions, retention, training usage) is centrally registered.
After the platform is stable and governance primitives are embedded, evolve toward strategic differentiation: fine‑tune or adapt local models for domain tasks, add retrieval augmentation with caching and guardrails, experiment with adaptive routing driven by real latency/cost/performance telemetry, and deliver a self‑service developer portal exposing key issuance, quota visibility, model status, and usage analytics. This progressive path keeps risk low while compounding value—each layer builds directly on validated demand rather than speculative infrastructure.
If you’re at the stage of moving from scattered AI experiments to a sustainable internal AI platform, LiteLLM offers the right balance of flexibility and control—while keeping the door open to hosted aggregators for rapid exploration.
You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.
Contact us