

Jan 31, 2026
vLLM Docker Deployment: Production-Ready Setup Guide (2026)
Inference Research
Introduction
Getting large language models into production is tricky. You need good performance, you can't waste GPU memory, and the whole thing has to be manageable by your ops team. vLLM has become the default choice for teams serious about LLM serving because it delivers up to 24x higher throughput than HuggingFace Transformers through its PagedAttention algorithm. Docker makes deployment practical by giving you reproducibility, isolation from host dependencies, and easy integration with orchestration platforms.
vLLM Docker deployment is the process of containerizing the vLLM inference engine for production LLM serving. Using the official vllm/vllm-openai Docker image, organizations can deploy high-throughput language model inference with PagedAttention memory optimization, OpenAI-compatible APIs, and GPU acceleration via NVIDIA Container Toolkit.
This guide walks you through the complete vLLM deployment journey. We'll start with a single-GPU quickstart, move through Docker Compose configurations for development and production, explore multi-GPU setups with tensor parallelism, and finish with Kubernetes deployment patterns and monitoring integration. Whether you're running experiments locally or preparing for production scale, this guide covers the configurations you need.
What is vLLM and Why Docker?
Before diving into deployment specifics, it helps to understand vLLM's architecture and why containerization makes sense for these workloads.
vLLM Overview
vLLM is an open-source inference and serving engine for large language models, originally developed at UC Berkeley's Sky Computing Lab. The core innovation is PagedAttention, an attention algorithm inspired by operating system virtual memory concepts. Traditional inference engines allocate contiguous memory blocks for key-value caches, wasting 60-80% of allocated memory due to fragmentation. PagedAttention partitions the KV cache into non-contiguous blocks, reducing memory waste to under 4%.
This memory efficiency translates directly to throughput gains. vLLM achieves 14-24x higher throughput than HuggingFace Transformers and 2-4x higher throughput than Text Generation Inference (TGI). The engine supports continuous batching, CUDA graph optimization, and multiple quantization formats including AWQ, GPTQ, and FP8.
Benefits of Containerized Deployment
Docker containers give you several advantages for LLM inference workloads:
- Environment consistency: The same container runs identically across development, staging, and production environments
- Dependency isolation: CUDA libraries, Python packages, and system dependencies are bundled together, eliminating version conflicts
- Reproducible builds: Pin image tags to keep deployments consistent over time
- Orchestration ready: Containers integrate natively with Kubernetes, Docker Swarm, and other orchestration platforms
- Resource management: Docker's cgroup integration enables fine-grained CPU, memory, and GPU allocation
The official vllm/vllm-openai image includes everything needed for production inference: the vLLM engine, an OpenAI-compatible API server, and optimized CUDA kernels. This eliminates hours of manual dependency management and gives you a tested, working configuration from the start.
Prerequisites
Before deploying vLLM in Docker, make sure your environment meets the hardware and software requirements. Running LLM inference requires significant GPU resources, and proper driver configuration is essential for container GPU access.
System Requirements
vLLM requires an NVIDIA GPU with compute capability 7.0 or higher. This includes V100, T4, A10, A100, and H100 GPUs. For production workloads, we recommend GPUs with at least 24GB of VRAM to handle models like Llama-3.1-8B comfortably. Larger models like Llama-2-70B require multiple GPUs or quantization.
Your host system needs a Linux kernel 5.4 or later. Ubuntu 22.04 LTS is the recommended distribution for production deployments. Windows users can experiment with WSL2, but production deployments should target native Linux.
| Requirement | Minimum | Recommended | Notes |
|---|---|---|---|
| NVIDIA GPU | Compute 7.0+ (V100) | A100/H100 | Check with nvidia-smi |
| GPU Memory | 8GB | 24GB+ | Per GPU |
| Docker Engine | 20.10 | 24.0+ | docker --version |
| NVIDIA Driver | 525+ | Latest stable | nvidia-smi |
| NVIDIA Container Toolkit | 1.13+ | Latest | nvidia-ctk --version |
| OS | Linux (kernel 5.4+) | Ubuntu 22.04 | Windows WSL2 experimental |
| Disk Space | 50GB | 200GB+ | For model caching |
| RAM | 16GB | 64GB+ | For model loading overhead |
Installing NVIDIA Container Toolkit
The NVIDIA Container Toolkit enables Docker containers to access host GPUs. Install it using your distribution's package manager:
# Add NVIDIA package repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerVerify GPU access works inside containers:
docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smiYou should see your GPU listed in the output. If this command fails, check your NVIDIA driver installation and make sure the container toolkit is properly configured.
Quick Start: Single-GPU Setup
Getting vLLM running in Docker takes just a few commands. This section covers pulling the official image, running the server, and verifying your deployment works correctly.
Pulling the Official Image
The vllm/vllm-openai image provides a production-ready OpenAI-compatible API server. Pull the latest version:
# Pull the official vLLM OpenAI-compatible server image
docker pull vllm/vllm-openai:latest
# Verify the image was downloaded
docker images | grep vllmThe image is approximately 9GB and includes all dependencies for GPU inference. For production deployments, pin to a specific version tag rather than latest to get reproducible builds.
Running vLLM with Docker
The following command starts vLLM with a Llama model using the vllm serve functionality exposed through the container:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000Each flag serves a specific purpose in enabling GPU inference:
| Flag | Purpose | Required | Example |
|---|---|---|---|
| --runtime nvidia | Use NVIDIA container runtime | Yes | --runtime nvidia |
| --gpus all | Enable access to all GPUs | Yes | --gpus all |
| --gpus '"device=0,1"' | Enable specific GPUs | Alternative | --gpus '"device=0,1"' |
| --ipc=host | Share host IPC namespace for multiprocessing | Yes | --ipc=host |
| -v ~/.cache/huggingface:/root/.cache/huggingface | Persist downloaded models | Recommended | See example |
| -p 8000:8000 | Expose vLLM API port | Yes | -p 8000:8000 |
| -e HF_TOKEN=xxx | HuggingFace token for gated models | Conditional | -e HF_TOKEN=hf_xxx |
| --shm-size=1g | Set shared memory size (alternative to --ipc=host) | Optional | --shm-size=1g |
| --name vllm | Assign container name | Recommended | --name vllm-server |
| -d | Run in detached mode | Optional | -d |
| --restart unless-stopped | Auto-restart policy | Production | --restart unless-stopped |
The --ipc=host flag is particularly important. vLLM uses shared memory for communication between processes during inference. Without this flag, you'll hit cryptic IPC errors during model loading or inference.
Warning: The --ipc=host flag shares the host's IPC namespace with the container. For environments with strict security requirements, use --shm-size=1g as an alternative, though this may limit performance with very large batch sizes.
Verifying the Deployment
Wait for the model to load. First startup downloads the model weights from HuggingFace, which can take several minutes depending on your connection speed. Subsequent startups load from the cached volume.
Follow these steps to deploy vLLM in Docker:
- Install Docker Engine (20.10+) on your Linux host
- Install NVIDIA Container Toolkit for GPU support
- Pull the official image: docker pull vllm/vllm-openai:latest
- Run with GPU access: docker run --runtime nvidia --gpus all
- Add required flags: --ipc=host for shared memory
- Mount HuggingFace cache for model persistence
- Expose port 8000 for API access
- Verify with health check: curl http://localhost:8000/health
Check the health endpoint:
curl http://localhost:8000/healthA healthy server returns a 200 status code. For more detail, use the readiness endpoint:
curl http://localhost:8000/health/readyTest the API with a chat completion request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 100
}'You should receive a JSON response with the model's completion. The API is fully OpenAI-compatible, so existing OpenAI client libraries work without modification.
Docker Compose for Development
Managing Docker run commands with many flags gets unwieldy during development. Docker Compose provides a cleaner way to define your vLLM configuration and makes iterating on settings much faster.
Basic docker-compose.yml
Create a docker-compose.yml file for your development environment:
version: "3.8"
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
ports:
- "8000:8000"
volumes:
- huggingface-cache:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN}
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.9
volumes:
huggingface-cache:Start the service with:
docker compose upThe Compose file centralizes your configuration, making it easy to track changes in version control and share setups across your team.
Environment Variables Configuration
Sensitive values like HuggingFace tokens should not appear in your Compose file. Create a .env file in the same directory:
# .env file for Docker Compose
HF_TOKEN=hf_your_huggingface_token_here
VLLM_MODEL=meta-llama/Llama-3.1-8B-InstructDocker Compose automatically loads variables from .env. Add this file to your .gitignore to prevent committing secrets.
For gated models like Llama, you must set HF_TOKEN to a valid HuggingFace access token. Generate one at https://huggingface.co/settings/tokens and make sure you've accepted the model's license agreement.
Volume Mounting for Model Persistence
The named volume huggingface-cache persists downloaded models between container restarts. Without this volume, Docker downloads model weights on every startup, wasting bandwidth and time.
Named volumes have advantages over bind mounts for model storage. Docker manages the volume location, permissions work correctly without manual configuration, and the volume persists even if you remove the container.
For development, you might prefer a bind mount to share a local cache:
volumes:
- ~/.cache/huggingface:/root/.cache/huggingfaceThis approach lets you use models downloaded outside Docker and saves disk space by avoiding duplicate caches.
Multi-GPU Configurations
Single-GPU deployments hit limits quickly with larger models. A Llama-2-70B model requires approximately 140GB of memory in FP16, far exceeding any single GPU. Multi-GPU configurations with tensor parallelism distribute the model across devices, letting you deploy models that would otherwise be impossible.
Understanding Tensor Parallelism
Tensor parallelism splits model layers across multiple GPUs within a single node. Each GPU holds a portion of each layer's weights and computes its share of the output. The GPUs communicate intermediate results through high-bandwidth interconnects like NVLink.
This approach works well when your model exceeds single-GPU memory but fits across available devices. For a 70B parameter model, tensor parallelism across 4 A100-80GB GPUs puts roughly 35GB of weights on each device, leaving headroom for KV cache and activations.
The trade-off is communication overhead. Every forward pass requires synchronization between GPUs, adding latency. NVLink reduces this overhead significantly compared to PCIe, so make sure your hardware configuration supports direct GPU-to-GPU links.
vLLM also supports pipeline parallelism for distributing layers sequentially across GPUs and expert parallelism for Mixture-of-Experts models like Mixtral. But tensor parallelism remains the most common choice for single-node multi-GPU deployments because it balances simplicity and performance well.
Docker with Multiple GPUs
Enable tensor parallelism by adding the --tensor-parallel-size flag:
# Run Llama-2-70B with tensor parallelism across 4 GPUs
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--host 0.0.0.0vLLM automatically distributes the model across the specified number of GPUs. The --gpus all flag makes all host GPUs visible to the container.
To select specific GPUs, use NVIDIA's device specification syntax:
# Use only GPUs 0 and 1
docker run --runtime nvidia --gpus '"device=0,1"' \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-2-13b-hf \
--tensor-parallel-size 2The quote escaping is intentional. Docker requires the device specification to be passed as a JSON string.
Memory Optimization Strategies
Even with multiple GPUs, memory pressure remains a concern. Several strategies help maximize available headroom.
Quantization reduces model precision to save memory. AWQ and GPTQ provide 4-bit quantization with minimal quality loss:
# Deploy a 4-bit quantized model (AWQ)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95Quantized models fit on fewer GPUs and leave more memory for KV cache, improving batch sizes and throughput.
Limiting context length with --max-model-len reduces KV cache requirements. If your application only needs 4096-token contexts, setting this limit prevents vLLM from allocating cache for the model's full context window.
Tuning --gpu-memory-utilization controls how much GPU memory vLLM reserves. The default of 0.9 leaves 10% headroom for CUDA kernels and fragmentation. Increase to 0.95 for memory-constrained setups, but watch for out-of-memory errors under load.
Production Docker Compose Stack
Development configurations lack the resilience and observability needed for production. A production stack adds monitoring, proper health checks, restart policies, and resource constraints.
vLLM with Prometheus and Grafana
The following Docker Compose file deploys vLLM alongside Prometheus for metrics collection and Grafana for visualization:
version: "3.8"
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
ports:
- "8000:8000"
volumes:
- huggingface-cache:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN}
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
limits:
memory: 32G
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 8000
--tensor-parallel-size 2
--gpu-memory-utilization 0.9
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 300s
restart: unless-stopped
networks:
- vllm-network
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
restart: unless-stopped
networks:
- vllm-network
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
restart: unless-stopped
networks:
- vllm-network
depends_on:
- prometheus
networks:
vllm-network:
driver: bridge
volumes:
huggingface-cache:
prometheus-data:
grafana-data:Create the Prometheus configuration file prometheus.yml:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm:8000']
metrics_path: /metrics
scrape_interval: 10sHealth Checks and Restart Policies
The health check configuration deserves special attention. LLM models take minutes to load, so the start_period must accommodate this startup time. A 300-second start period prevents Docker from killing the container before the model finishes loading.
The restart: unless-stopped policy means the service recovers from crashes but allows manual stops for maintenance. For mission-critical deployments, consider restart: always with proper alerting on restart events.
Resource Limits and Reservations
The deploy.resources section controls container resource allocation. GPU reservations give vLLM exclusive access to required devices. Memory limits prevent runaway processes from affecting other services on the host.
Set memory limits based on your model's requirements plus overhead for the Python runtime and HTTP server. For an 8B parameter model, 32GB provides comfortable headroom. Monitor actual usage and adjust accordingly.
Tip: The count: all GPU reservation claims all available GPUs. For multi-tenant hosts, specify an explicit count or use device_ids to assign specific GPUs to each container.
Kubernetes Deployment
Kubernetes provides solid orchestration for production vLLM deployments. GPU scheduling, rolling updates, and integration with cluster-wide monitoring make Kubernetes the preferred platform for teams already invested in the ecosystem.
Basic Deployment YAML
The following manifest deploys vLLM with GPU resources, proper probes, and secrets management:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
labels:
app: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
- "--gpu-memory-utilization"
- "0.9"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "16Gi"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
volumeMounts:
- name: cache
mountPath: /root/.cache/huggingface
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 300
periodSeconds: 10
timeoutSeconds: 5
volumes:
- name: cache
persistentVolumeClaim:
claimName: hf-cache-pvcCreate the HuggingFace token secret before applying the deployment:
kubectl create secret generic hf-secret --from-literal=token=hf_your_token_hereService and Ingress Configuration
Expose the deployment with a Service:
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: ClusterIPFor external access, create an Ingress or use a LoadBalancer service type. Consider adding authentication at the Ingress layer since vLLM does not provide built-in authentication.
The ClusterIP type keeps the service internal to the cluster, which is appropriate when you route traffic through an API gateway or Ingress controller. For direct external access during development, change the type to LoadBalancer, though this exposes the API without authentication.
Production deployments should implement authentication through an API gateway like Kong or Ambassador, or use an Ingress controller with OAuth2 proxy integration. vLLM accepts an --api-key flag for basic token authentication, but real authentication belongs at the infrastructure layer.
Horizontal Pod Autoscaling
Scaling LLM inference horizontally requires careful thought. Unlike stateless web services, each vLLM replica loads the full model into GPU memory. Scaling adds replicas rather than distributing load across existing resources.
Standard CPU and memory metrics don't capture LLM serving load effectively. Instead, scale on custom metrics like queue depth:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "50"This requires the Prometheus Adapter to expose vLLM metrics as Kubernetes custom metrics. Configure carefully to avoid over-scaling, which consumes expensive GPU resources.
Note: The vLLM project provides Helm charts in examples/online_serving/chart-helm/ for more sophisticated deployments including StatefulSets and PodDisruptionBudgets.
Monitoring and Observability
vLLM exposes Prometheus metrics that reveal inference performance, resource utilization, and potential bottlenecks. Good monitoring catches problems before they impact users.
Key Prometheus Metrics
vLLM exposes metrics at the /metrics endpoint in Prometheus format. The most important metrics for operational visibility:
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
| vllm:num_requests_running | Gauge | Currently processing requests | N/A (info) |
| vllm:num_requests_waiting | Gauge | Requests in queue | > 100 |
| vllm:num_requests_swapped | Gauge | Requests swapped to CPU | > 0 |
| vllm:gpu_cache_usage_perc | Gauge | KV cache utilization | > 95% |
| vllm:cpu_cache_usage_perc | Gauge | CPU swap cache utilization | > 50% |
| vllm:time_to_first_token_seconds | Histogram | TTFT latency | p99 > 5s |
| vllm:time_per_output_token_seconds | Histogram | TPOT latency | p99 > 0.1s |
| vllm:e2e_request_latency_seconds | Histogram | Total request latency | p99 > 30s |
| vllm:num_preemptions_total | Counter | Total request preemptions | Increasing trend |
| vllm:prompt_tokens_total | Counter | Total input tokens processed | N/A (throughput) |
| vllm:generation_tokens_total | Counter | Total output tokens generated | N/A (throughput) |
Request metrics show workload patterns. num_requests_running indicates active processing, while num_requests_waiting reveals queue buildup. A steadily growing waiting count signals insufficient capacity.
Latency histograms expose user-facing performance. time_to_first_token_seconds measures how quickly users see initial output. e2e_request_latency_seconds captures total request time including generation.
Cache metrics indicate memory pressure. gpu_cache_usage_perc approaching 100% means the KV cache is full, forcing request rejection or swapping.
Grafana Dashboard Setup
vLLM provides pre-built Grafana dashboards in the repository under examples/online_serving/dashboards/. Import these into your Grafana instance for immediate visibility.
Configure Prometheus as a data source in Grafana, then import the dashboard JSON. The dashboards include panels for request rates, latency percentiles, cache utilization, and GPU metrics.
For custom dashboards, start with these essential panels:
- Request rate (requests/second)
- p50/p95/p99 time-to-first-token latency
- p50/p95/p99 end-to-end latency
- Queue depth over time
- GPU cache utilization percentage
Time-to-first-token (TTFT) latency deserves particular attention for interactive applications. Users perceive responsiveness through how quickly the first token appears, not total generation time. A dashboard panel showing TTFT percentiles helps identify prefill bottlenecks that affect user experience.
For batch processing workloads, focus on throughput metrics instead. Track tokens generated per second and requests completed per minute. These metrics reveal whether your deployment efficiently utilizes GPU resources during sustained load.
Alerting Best Practices
Configure alerts for conditions requiring operator attention:
# alerts.yml
groups:
- name: vllm-alerts
rules:
- alert: VLLMHighQueueDepth
expr: vllm:num_requests_waiting > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High request queue depth"
description: "vLLM has {{ $value }} requests waiting"
- alert: VLLMHighCacheUsage
expr: vllm:gpu_cache_usage_perc > 0.95
for: 10m
labels:
severity: critical
annotations:
summary: "GPU cache near capacity"
description: "KV cache utilization at {{ $value | humanizePercentage }}"
- alert: VLLMHighLatency
expr: histogram_quantile(0.99, vllm:e2e_request_latency_seconds_bucket) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "High p99 latency"
description: "p99 latency is {{ $value }}s"Set thresholds based on your SLOs. The examples above provide reasonable starting points, but adjust based on observed baseline performance and business requirements.
Performance Tuning
Default vLLM configurations work well for evaluation but leave performance on the table. Tuning memory utilization, enabling optimized backends, and right-sizing batch parameters can significantly improve throughput and latency.
GPU Memory Utilization
The --gpu-memory-utilization flag controls what fraction of GPU memory vLLM reserves for the KV cache and model weights. Higher values allow more concurrent sequences but risk out-of-memory errors under load.
Start with the default 0.9 and increase incrementally while monitoring for OOM errors. For stable production deployments:
--gpu-memory-utilization 0.85 # Conservative, very stable
--gpu-memory-utilization 0.90 # Default, good balance
--gpu-memory-utilization 0.95 # Aggressive, maximum throughputHigher utilization benefits batch workloads with predictable request patterns. Lower utilization provides safety margins for bursty traffic with variable sequence lengths.
Quantization in Containers
Quantization reduces model precision from 16-bit to 8-bit or 4-bit representations. This cuts memory requirements substantially, letting you run larger models or higher batch sizes on the same hardware.
vLLM supports several quantization methods:
- AWQ (Activation-aware Weight Quantization): 4-bit, minimal quality loss, fastest inference
- GPTQ: 4-bit, good quality, widely available models
- FP8: 8-bit float, excellent quality, requires Hopper/Ada GPUs
Deploy with quantization:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--gpu-memory-utilization 0.95Quantized models must be pre-quantized using the specified method. Check HuggingFace for AWQ or GPTQ versions of popular models.
Batch Size Optimization
vLLM uses continuous batching to process multiple requests simultaneously. The --max-num-seqs parameter limits concurrent sequences:
# Performance-optimized deployment for high throughput
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.92 \
--max-model-len 8192 \
--max-num-seqs 256 \
--disable-log-requestsHigher --max-num-seqs improves throughput at the cost of per-request latency. The optimal value depends on your traffic pattern. Batch-oriented offline processing benefits from high values, while interactive applications need lower settings for consistent response times.
The --disable-log-requests flag reduces logging overhead in high-throughput scenarios. Enable during debugging, disable in production unless you need request-level logs.
| Argument | Default | Description | Use Case |
|---|---|---|---|
| --model | Required | HuggingFace model ID or local path | All deployments |
| --tensor-parallel-size | 1 | Number of GPUs for tensor parallelism | Multi-GPU setups |
| --pipeline-parallel-size | 1 | Number of GPUs for pipeline parallelism | Multi-node setups |
| --gpu-memory-utilization | 0.9 | Target GPU memory usage (0.0-1.0) | Memory tuning |
| --max-model-len | Model default | Maximum context length | Memory saving |
| --max-num-seqs | 256 | Maximum concurrent sequences | Throughput tuning |
| --max-num-batched-tokens | Auto | Maximum tokens per batch | Memory/latency trade-off |
| --quantization | None | Quantization method (awq, gptq, fp8, squeezellm) | Memory reduction |
| --dtype | auto | Model data type (auto, half, float16, bfloat16) | Precision control |
| --host | localhost | Bind address | Container deployment |
| --port | 8000 | API port | Custom ports |
| --api-key | None | API authentication key | Basic security |
| --served-model-name | Model name | Custom model name in API | API customization |
| --disable-log-requests | False | Disable request logging | Production performance |
| --enable-prefix-caching | False | Enable automatic prefix caching | Repeated prompts |
For production deployments handling hundreds of requests per second, also consider the VLLM_ATTENTION_BACKEND environment variable. The FlashInfer backend offers improved performance on newer GPUs and should be tested against the default FlashAttention backend for your specific hardware and model combination.
Troubleshooting Common Issues
vLLM Docker deployments encounter predictable failure modes. This section covers the most common errors and their solutions.
GPU Not Detected
Symptom: Container starts but fails with "No CUDA-capable device detected" or similar GPU errors.
Causes and Solutions:
- Missing NVIDIA Container Toolkit: Reinstall following the prerequisites section. Verify with docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi.
- Missing --gpus flag: Make sure your Docker run command includes --gpus all or --gpus '"device=0"'.
- Driver mismatch: The container CUDA version must be compatible with your host NVIDIA driver. Check versions with nvidia-smi on the host and compare against the vLLM image requirements.
- Docker Compose syntax: For Compose files, use the deploy.resources.reservations syntax rather than --gpus flags.
CUDA Out of Memory
Symptom: Model fails to load or crashes during inference with "CUDA out of memory" errors.
Solutions:
- Reduce --gpu-memory-utilization from 0.9 to 0.8 or lower
- Enable quantization with --quantization awq or --quantization gptq
- Increase --tensor-parallel-size to spread the model across more GPUs
- Reduce --max-model-len to lower KV cache requirements
- Use a smaller model if hardware is genuinely insufficient
Model Loading Failures
Symptom: Container fails during model download with 401 errors or "model not found" messages.
Solutions:
- Set HF_TOKEN for gated models like Llama. Make sure the token has read permissions.
- Accept the model license on HuggingFace before attempting to download.
- Check network connectivity from within the container. Some corporate networks block HuggingFace.
- Verify the model name is spelled correctly and exists on HuggingFace.
| Error Message | Cause | Solution |
|---|---|---|
| CUDA out of memory | Model too large for GPU memory | Use quantization, increase --tensor-parallel-size, or reduce --max-model-len |
| GPU not detected / No CUDA-capable device | Missing NVIDIA runtime | Install nvidia-container-toolkit, use --gpus all flag |
| Model not found / 401 Unauthorized | Gated model without token | Set HF_TOKEN environment variable |
| KeyboardInterrupt: terminated | K8s probe killed container | Increase probe initialDelaySeconds to 300+ |
| IPC shared memory error | Missing IPC flag | Add --ipc=host or --shm-size=1g |
| Connection refused on port 8000 | Server not ready | Wait for model loading (check logs), verify port mapping |
| torch.cuda.OutOfMemoryError | Insufficient VRAM during inference | Lower --gpu-memory-utilization to 0.8 |
| ValueError: Cannot use FlashAttention | Incompatible GPU architecture | Use --enforce-eager or upgrade GPU |
| RuntimeError: NCCL error | Multi-GPU communication failure | Check NVLink status, try --disable-custom-all-reduce |
| HTTPError: 403 Forbidden | Model license not accepted | Accept license on HuggingFace model page |
Kubernetes Probe Timeouts
Symptom: Pod repeatedly restarts with "KeyboardInterrupt: terminated" in logs.
Cause: Kubernetes kills the container because health probes fail during model loading.
Solution: Increase initialDelaySeconds to 300 or higher on both liveness and readiness probes. Large models like Llama-2-70B can take 5+ minutes to load even with fast storage.
Slow Inference Performance
Symptom: Model responds but latency is higher than expected based on benchmarks.
Causes and Solutions:
- Missing CUDA graphs: By default, vLLM compiles CUDA graphs for faster execution. The first few requests after startup are slow while graphs compile. This is normal.
- Suboptimal batch configuration: Very low --max-num-seqs limits concurrent processing. Increase to allow better GPU utilization.
- Insufficient GPU memory: When --gpu-memory-utilization is too low, the KV cache cannot hold enough sequences for efficient batching. Increase utilization if memory allows.
- Network overhead: For remote model storage or slow HuggingFace downloads, subsequent inference is fast but initial requests hit network latency. Use local model caching to eliminate this delay.
Conclusion
Deploying vLLM with Docker turns complex LLM infrastructure into manageable, reproducible configurations. From single-GPU development setups to multi-GPU production stacks with monitoring, containerization provides the consistency and portability modern teams expect.
The key takeaways from this guide:
- Start with the official vllm/vllm-openai image and essential flags (--gpus all, --ipc=host, volume mounts)
- Use Docker Compose to manage configuration complexity and enable rapid iteration
- Scale to multi-GPU with tensor parallelism for models exceeding single-GPU memory
- Build production stacks with Prometheus and Grafana for operational visibility
- Deploy to Kubernetes with extended probe timeouts to accommodate model loading
For teams who want the power of vLLM without managing Docker infrastructure, there's an easier path.
Skip the Docker complexity. Deploy vLLM on Inference.net in minutes.
Inference.net offers fully managed vLLM deployments with automatic scaling, built-in monitoring, and enterprise-grade reliability. Focus on building your application while Inference.net handles the infrastructure.
Own your model. Scale with confidence.
Schedule a call with our research team to learn more about custom training. We'll propose a plan that beats your current SLA and unit cost.





