Summary
vLLM is an open-source library for serving large language models with high throughput and efficient memory use. Originally developed at UC Berkeley, it is widely used to self-host open-weight models in production, and it exposes an OpenAI-compatible HTTP API so existing applications can point at it with minimal changes.
What is vLLM?
vLLM's core innovation is PagedAttention, a technique that manages the model's attention key/value cache much like virtual memory manages RAM—reducing waste and allowing many requests to be served concurrently. Combined with features such as continuous batching, speculative decoding, and prefix caching, this gives vLLM high throughput for production inference workloads.
It runs across a wide range of hardware (NVIDIA, AMD, Intel, Arm, and dedicated accelerators) and supports distributed inference across multiple GPUs and nodes, which is useful for sovereign LLM serving that is not tied to a single hyperscaler. Because it speaks an OpenAI-compatible API, an application originally written against a cloud LLM can often switch to a self-hosted vLLM backend by changing little more than the base URL. Where Ollama targets easy local experimentation, vLLM targets high-throughput, production-grade serving.
Why is vLLM relevant?
- High throughput: PagedAttention and continuous batching serve many concurrent requests efficiently
- Self-hosted and sovereign: Run open-weight models on your own hardware, free of hyperscaler dependence
- Drop-in API: An OpenAI-compatible endpoint lets existing apps switch backends with minimal change
- Hardware flexibility: Broad accelerator support and distributed inference for scaling