vLLM Docker Deployment: Production-Ready Setup Guide (2026)

Introduction

Getting large language models into production is tricky. You need good performance, you can't waste GPU memory, and the whole thing has to be manageable by your ops team. vLLM has become the default choice for teams serious about LLM serving because it delivers up to 24x higher throughput than HuggingFace Transformers through its PagedAttention algorithm. Docker makes deployment practical by giving you reproducibility, isolation from host dependencies, and easy integration with orchestration platforms.

vLLM Docker deployment is the process of containerizing the vLLM inference engine for production LLM serving. Using the official vllm/vllm-openai Docker image, organizations can deploy high-throughput language model inference with PagedAttention memory optimization, OpenAI-compatible APIs, and GPU acceleration via NVIDIA Container Toolkit.

This guide walks you through the complete vLLM deployment journey. We'll start with a single-GPU quickstart, move through Docker Compose configurations for development and production, explore multi-GPU setups with tensor parallelism, and finish with Kubernetes deployment patterns and monitoring integration. Whether you're running experiments locally or preparing for production scale, this guide covers the configurations you need.

What is vLLM and Why Docker?

Before diving into deployment specifics, it helps to understand vLLM's architecture and why containerization makes sense for these workloads.

vLLM Overview

vLLM is an open-source inference and serving engine for large language models, originally developed at UC Berkeley's Sky Computing Lab. The core innovation is PagedAttention, an attention algorithm inspired by operating system virtual memory concepts. Traditional inference engines allocate contiguous memory blocks for key-value caches, wasting 60-80% of allocated memory due to fragmentation. PagedAttention partitions the KV cache into non-contiguous blocks, reducing memory waste to under 4%.

This memory efficiency translates directly to throughput gains. vLLM achieves 14-24x higher throughput than HuggingFace Transformers and 2-4x higher throughput than Text Generation Inference (TGI). The engine supports continuous batching, CUDA graph optimization, and multiple quantization formats including AWQ, GPTQ, and FP8.

Benefits of Containerized Deployment

Docker containers give you several advantages for LLM inference workloads:

Environment consistency: The same container runs identically across development, staging, and production environments
Dependency isolation: CUDA libraries, Python packages, and system dependencies are bundled together, eliminating version conflicts
Reproducible builds: Pin image tags to keep deployments consistent over time
Orchestration ready: Containers integrate natively with Kubernetes, Docker Swarm, and other orchestration platforms
Resource management: Docker's cgroup integration enables fine-grained CPU, memory, and GPU allocation

The official vllm/vllm-openai image includes everything needed for production inference: the vLLM engine, an OpenAI-compatible API server, and optimized CUDA kernels. This eliminates hours of manual dependency management and gives you a tested, working configuration from the start.

Prerequisites

Before deploying vLLM in Docker, make sure your environment meets the hardware and software requirements. Running LLM inference requires significant GPU resources, and proper driver configuration is essential for container GPU access.

System Requirements

vLLM requires an NVIDIA GPU with compute capability 7.0 or higher. This includes V100, T4, A10, A100, and H100 GPUs. For production workloads, we recommend GPUs with at least 24GB of VRAM to handle models like Llama-3.1-8B comfortably. Larger models like Llama-2-70B require multiple GPUs or quantization.

Your host system needs a Linux kernel 5.4 or later. Ubuntu 22.04 LTS is the recommended distribution for production deployments. Windows users can experiment with WSL2, but production deployments should target native Linux.

Requirement	Minimum	Recommended	Notes
NVIDIA GPU	Compute 7.0+ (V100)	A100/H100	Check with nvidia-smi
GPU Memory	8GB	24GB+	Per GPU
Docker Engine	20.10	24.0+	docker --version
NVIDIA Driver	525+	Latest stable	nvidia-smi
NVIDIA Container Toolkit	1.13+	Latest	nvidia-ctk --version
OS	Linux (kernel 5.4+)	Ubuntu 22.04	Windows WSL2 experimental
Disk Space	50GB	200GB+	For model caching
RAM	16GB	64GB+	For model loading overhead

Installing NVIDIA Container Toolkit

The NVIDIA Container Toolkit enables Docker containers to access host GPUs. Install it using your distribution's package manager:

# Add NVIDIA package repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access works inside containers:

docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi

You should see your GPU listed in the output. If this command fails, check your NVIDIA driver installation and make sure the container toolkit is properly configured.

Quick Start: Single-GPU Setup

Getting vLLM running in Docker takes just a few commands. This section covers pulling the official image, running the server, and verifying your deployment works correctly.

Pulling the Official Image

The vllm/vllm-openai image provides a production-ready OpenAI-compatible API server. Pull the latest version:

# Pull the official vLLM OpenAI-compatible server image
docker pull vllm/vllm-openai:latest

# Verify the image was downloaded
docker images | grep vllm

The image is approximately 9GB and includes all dependencies for GPU inference. For production deployments, pin to a specific version tag rather than latest to get reproducible builds.

Running vLLM with Docker

The following command starts vLLM with a Llama model using the vllm serve functionality exposed through the container:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Each flag serves a specific purpose in enabling GPU inference:

Flag	Purpose	Required	Example
--runtime nvidia	Use NVIDIA container runtime	Yes	--runtime nvidia
--gpus all	Enable access to all GPUs	Yes	--gpus all
--gpus '"device=0,1"'	Enable specific GPUs	Alternative	--gpus '"device=0,1"'
--ipc=host	Share host IPC namespace for multiprocessing	Yes	--ipc=host
-v ~/.cache/huggingface:/root/.cache/huggingface	Persist downloaded models	Recommended	See example
-p 8000:8000	Expose vLLM API port	Yes	-p 8000:8000
-e HF_TOKEN=xxx	HuggingFace token for gated models	Conditional	-e HF_TOKEN=hf_xxx
--shm-size=1g	Set shared memory size (alternative to --ipc=host)	Optional	--shm-size=1g
--name vllm	Assign container name	Recommended	--name vllm-server
-d	Run in detached mode	Optional	-d
--restart unless-stopped	Auto-restart policy	Production	--restart unless-stopped

The --ipc=host flag is particularly important. vLLM uses shared memory for communication between processes during inference. Without this flag, you'll hit cryptic IPC errors during model loading or inference.

Warning: The --ipc=host flag shares the host's IPC namespace with the container. For environments with strict security requirements, use --shm-size=1g as an alternative, though this may limit performance with very large batch sizes.

Verifying the Deployment

Wait for the model to load. First startup downloads the model weights from HuggingFace, which can take several minutes depending on your connection speed. Subsequent startups load from the cached volume.

Follow these steps to deploy vLLM in Docker:

Install Docker Engine (20.10+) on your Linux host
Install NVIDIA Container Toolkit for GPU support
Pull the official image: docker pull vllm/vllm-openai:latest
Run with GPU access: docker run --runtime nvidia --gpus all
Add required flags: --ipc=host for shared memory
Mount HuggingFace cache for model persistence
Expose port 8000 for API access
Verify with health check: curl http://localhost:8000/health

Check the health endpoint:

curl http://localhost:8000/health

A healthy server returns a 200 status code. For more detail, use the readiness endpoint:

curl http://localhost:8000/health/ready

Test the API with a chat completion request:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "user", "content": "Hello, how are you?"}
        ],
        "max_tokens": 100
    }'

You should receive a JSON response with the model's completion. The API is fully OpenAI-compatible, so existing OpenAI client libraries work without modification.

Docker Compose for Development

Managing Docker run commands with many flags gets unwieldy during development. Docker Compose provides a cleaner way to define your vLLM configuration and makes iterating on settings much faster.

Basic docker-compose.yml

Create a docker-compose.yml file for your development environment:

version: "3.8"

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    ports:
      - "8000:8000"
    volumes:
      - huggingface-cache:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --host 0.0.0.0
      --port 8000
      --gpu-memory-utilization 0.9

volumes:
  huggingface-cache:

Start the service with:

docker compose up

The Compose file centralizes your configuration, making it easy to track changes in version control and share setups across your team.

Environment Variables Configuration

Sensitive values like HuggingFace tokens should not appear in your Compose file. Create a .env file in the same directory:

# .env file for Docker Compose
HF_TOKEN=hf_your_huggingface_token_here
VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct

Docker Compose automatically loads variables from .env. Add this file to your .gitignore to prevent committing secrets.

For gated models like Llama, you must set HF_TOKEN to a valid HuggingFace access token. Generate one at https://huggingface.co/settings/tokens and make sure you've accepted the model's license agreement.

Volume Mounting for Model Persistence

The named volume huggingface-cache persists downloaded models between container restarts. Without this volume, Docker downloads model weights on every startup, wasting bandwidth and time.

Named volumes have advantages over bind mounts for model storage. Docker manages the volume location, permissions work correctly without manual configuration, and the volume persists even if you remove the container.

For development, you might prefer a bind mount to share a local cache:

volumes:
  - ~/.cache/huggingface:/root/.cache/huggingface

This approach lets you use models downloaded outside Docker and saves disk space by avoiding duplicate caches.

Multi-GPU Configurations

Single-GPU deployments hit limits quickly with larger models. A Llama-2-70B model requires approximately 140GB of memory in FP16, far exceeding any single GPU. Multi-GPU configurations with tensor parallelism distribute the model across devices, letting you deploy models that would otherwise be impossible.

Understanding Tensor Parallelism

Tensor parallelism splits model layers across multiple GPUs within a single node. Each GPU holds a portion of each layer's weights and computes its share of the output. The GPUs communicate intermediate results through high-bandwidth interconnects like NVLink.

This approach works well when your model exceeds single-GPU memory but fits across available devices. For a 70B parameter model, tensor parallelism across 4 A100-80GB GPUs puts roughly 35GB of weights on each device, leaving headroom for KV cache and activations.

The trade-off is communication overhead. Every forward pass requires synchronization between GPUs, adding latency. NVLink reduces this overhead significantly compared to PCIe, so make sure your hardware configuration supports direct GPU-to-GPU links.

vLLM also supports pipeline parallelism for distributing layers sequentially across GPUs and expert parallelism for Mixture-of-Experts models like Mixtral. But tensor parallelism remains the most common choice for single-node multi-GPU deployments because it balances simplicity and performance well.

Docker with Multiple GPUs

Enable tensor parallelism by adding the --tensor-parallel-size flag:

# Run Llama-2-70B with tensor parallelism across 4 GPUs
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --host 0.0.0.0

vLLM automatically distributes the model across the specified number of GPUs. The --gpus all flag makes all host GPUs visible to the container.

To select specific GPUs, use NVIDIA's device specification syntax:

# Use only GPUs 0 and 1
docker run --runtime nvidia --gpus '"device=0,1"' \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-2-13b-hf \
    --tensor-parallel-size 2

The quote escaping is intentional. Docker requires the device specification to be passed as a JSON string.

Memory Optimization Strategies

Even with multiple GPUs, memory pressure remains a concern. Several strategies help maximize available headroom.

Quantization reduces model precision to save memory. AWQ and GPTQ provide 4-bit quantization with minimal quality loss:

# Deploy a 4-bit quantized model (AWQ)
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model TheBloke/Llama-2-70B-AWQ \
    --quantization awq \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.95

Quantized models fit on fewer GPUs and leave more memory for KV cache, improving batch sizes and throughput.

Limiting context length with --max-model-len reduces KV cache requirements. If your application only needs 4096-token contexts, setting this limit prevents vLLM from allocating cache for the model's full context window.

Tuning --gpu-memory-utilization controls how much GPU memory vLLM reserves. The default of 0.9 leaves 10% headroom for CUDA kernels and fragmentation. Increase to 0.95 for memory-constrained setups, but watch for out-of-memory errors under load.

Production Docker Compose Stack

Development configurations lack the resilience and observability needed for production. A production stack adds monitoring, proper health checks, restart policies, and resource constraints.

vLLM with Prometheus and Grafana

The following Docker Compose file deploys vLLM alongside Prometheus for metrics collection and Grafana for visualization:

version: "3.8"

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    ports:
      - "8000:8000"
    volumes:
      - huggingface-cache:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
        limits:
          memory: 32G
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.9
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    restart: unless-stopped
    networks:
      - vllm-network

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    restart: unless-stopped
    networks:
      - vllm-network

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
    restart: unless-stopped
    networks:
      - vllm-network
    depends_on:
      - prometheus

networks:
  vllm-network:
    driver: bridge

volumes:
  huggingface-cache:
  prometheus-data:
  grafana-data:

Create the Prometheus configuration file prometheus.yml:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['vllm:8000']
    metrics_path: /metrics
    scrape_interval: 10s

Health Checks and Restart Policies

The health check configuration deserves special attention. LLM models take minutes to load, so the start_period must accommodate this startup time. A 300-second start period prevents Docker from killing the container before the model finishes loading.

The restart: unless-stopped policy means the service recovers from crashes but allows manual stops for maintenance. For mission-critical deployments, consider restart: always with proper alerting on restart events.

Resource Limits and Reservations

The deploy.resources section controls container resource allocation. GPU reservations give vLLM exclusive access to required devices. Memory limits prevent runaway processes from affecting other services on the host.

Set memory limits based on your model's requirements plus overhead for the Python runtime and HTTP server. For an 8B parameter model, 32GB provides comfortable headroom. Monitor actual usage and adjust accordingly.

Tip: The count: all GPU reservation claims all available GPUs. For multi-tenant hosts, specify an explicit count or use device_ids to assign specific GPUs to each container.

Kubernetes Deployment

Kubernetes provides solid orchestration for production vLLM deployments. GPU scheduling, rolling updates, and integration with cluster-wide monitoring make Kubernetes the preferred platform for teams already invested in the ecosystem.

Basic Deployment YAML

The following manifest deploys vLLM with GPU resources, proper probes, and secrets management:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
            - "--gpu-memory-utilization"
            - "0.9"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "32Gi"
            requests:
              nvidia.com/gpu: 1
              memory: "16Gi"
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
          volumeMounts:
            - name: cache
              mountPath: /root/.cache/huggingface
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 300
            periodSeconds: 10
            timeoutSeconds: 5
      volumes:
        - name: cache
          persistentVolumeClaim:
            claimName: hf-cache-pvc

Create the HuggingFace token secret before applying the deployment:

kubectl create secret generic hf-secret --from-literal=token=hf_your_token_here

Service and Ingress Configuration

Expose the deployment with a Service:

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

For external access, create an Ingress or use a LoadBalancer service type. Consider adding authentication at the Ingress layer since vLLM does not provide built-in authentication.

The ClusterIP type keeps the service internal to the cluster, which is appropriate when you route traffic through an API gateway or Ingress controller. For direct external access during development, change the type to LoadBalancer, though this exposes the API without authentication.

Production deployments should implement authentication through an API gateway like Kong or Ambassador, or use an Ingress controller with OAuth2 proxy integration. vLLM accepts an --api-key flag for basic token authentication, but real authentication belongs at the infrastructure layer.

Horizontal Pod Autoscaling

Scaling LLM inference horizontally requires careful thought. Unlike stateless web services, each vLLM replica loads the full model into GPU memory. Scaling adds replicas rather than distributing load across existing resources.

Standard CPU and memory metrics don't capture LLM serving load effectively. Instead, scale on custom metrics like queue depth:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_waiting
        target:
          type: AverageValue
          averageValue: "50"

This requires the Prometheus Adapter to expose vLLM metrics as Kubernetes custom metrics. Configure carefully to avoid over-scaling, which consumes expensive GPU resources.

Note: The vLLM project provides Helm charts in examples/online_serving/chart-helm/ for more sophisticated deployments including StatefulSets and PodDisruptionBudgets.

Monitoring and Observability

vLLM exposes Prometheus metrics that reveal inference performance, resource utilization, and potential bottlenecks. Good monitoring catches problems before they impact users.

Key Prometheus Metrics

vLLM exposes metrics at the /metrics endpoint in Prometheus format. The most important metrics for operational visibility:

Metric	Type	Description	Alert Threshold
vllm:num_requests_running	Gauge	Currently processing requests	N/A (info)
vllm:num_requests_waiting	Gauge	Requests in queue	> 100
vllm:num_requests_swapped	Gauge	Requests swapped to CPU	> 0
vllm:gpu_cache_usage_perc	Gauge	KV cache utilization	> 95%
vllm:cpu_cache_usage_perc	Gauge	CPU swap cache utilization	> 50%
vllm:time_to_first_token_seconds	Histogram	TTFT latency	p99 > 5s
vllm:time_per_output_token_seconds	Histogram	TPOT latency	p99 > 0.1s
vllm:e2e_request_latency_seconds	Histogram	Total request latency	p99 > 30s
vllm:num_preemptions_total	Counter	Total request preemptions	Increasing trend
vllm:prompt_tokens_total	Counter	Total input tokens processed	N/A (throughput)
vllm:generation_tokens_total	Counter	Total output tokens generated	N/A (throughput)

Request metrics show workload patterns. num_requests_running indicates active processing, while num_requests_waiting reveals queue buildup. A steadily growing waiting count signals insufficient capacity.

Latency histograms expose user-facing performance. time_to_first_token_seconds measures how quickly users see initial output. e2e_request_latency_seconds captures total request time including generation.

Cache metrics indicate memory pressure. gpu_cache_usage_perc approaching 100% means the KV cache is full, forcing request rejection or swapping.

Grafana Dashboard Setup

vLLM provides pre-built Grafana dashboards in the repository under examples/online_serving/dashboards/. Import these into your Grafana instance for immediate visibility.

Configure Prometheus as a data source in Grafana, then import the dashboard JSON. The dashboards include panels for request rates, latency percentiles, cache utilization, and GPU metrics.

For custom dashboards, start with these essential panels:

Request rate (requests/second)
p50/p95/p99 time-to-first-token latency
p50/p95/p99 end-to-end latency
Queue depth over time
GPU cache utilization percentage

Time-to-first-token (TTFT) latency deserves particular attention for interactive applications. Users perceive responsiveness through how quickly the first token appears, not total generation time. A dashboard panel showing TTFT percentiles helps identify prefill bottlenecks that affect user experience.

For batch processing workloads, focus on throughput metrics instead. Track tokens generated per second and requests completed per minute. These metrics reveal whether your deployment efficiently utilizes GPU resources during sustained load.

Alerting Best Practices

Configure alerts for conditions requiring operator attention:

# alerts.yml
groups:
  - name: vllm-alerts
    rules:
      - alert: VLLMHighQueueDepth
        expr: vllm:num_requests_waiting > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High request queue depth"
          description: "vLLM has {{ $value }} requests waiting"

      - alert: VLLMHighCacheUsage
        expr: vllm:gpu_cache_usage_perc > 0.95
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "GPU cache near capacity"
          description: "KV cache utilization at {{ $value | humanizePercentage }}"

      - alert: VLLMHighLatency
        expr: histogram_quantile(0.99, vllm:e2e_request_latency_seconds_bucket) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p99 latency"
          description: "p99 latency is {{ $value }}s"

Set thresholds based on your SLOs. The examples above provide reasonable starting points, but adjust based on observed baseline performance and business requirements.

Performance Tuning

Default vLLM configurations work well for evaluation but leave performance on the table. Tuning memory utilization, enabling optimized backends, and right-sizing batch parameters can significantly improve throughput and latency.

GPU Memory Utilization

The --gpu-memory-utilization flag controls what fraction of GPU memory vLLM reserves for the KV cache and model weights. Higher values allow more concurrent sequences but risk out-of-memory errors under load.

Start with the default 0.9 and increase incrementally while monitoring for OOM errors. For stable production deployments:

--gpu-memory-utilization 0.85  # Conservative, very stable
--gpu-memory-utilization 0.90  # Default, good balance
--gpu-memory-utilization 0.95  # Aggressive, maximum throughput

Higher utilization benefits batch workloads with predictable request patterns. Lower utilization provides safety margins for bursty traffic with variable sequence lengths.

Quantization in Containers

Quantization reduces model precision from 16-bit to 8-bit or 4-bit representations. This cuts memory requirements substantially, letting you run larger models or higher batch sizes on the same hardware.

vLLM supports several quantization methods:

AWQ (Activation-aware Weight Quantization): 4-bit, minimal quality loss, fastest inference
GPTQ: 4-bit, good quality, widely available models
FP8: 8-bit float, excellent quality, requires Hopper/Ada GPUs

Deploy with quantization:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model TheBloke/Llama-2-70B-AWQ \
    --quantization awq \
    --gpu-memory-utilization 0.95

Quantized models must be pre-quantized using the specified method. Check HuggingFace for AWQ or GPTQ versions of popular models.

Batch Size Optimization

vLLM uses continuous batching to process multiple requests simultaneously. The --max-num-seqs parameter limits concurrent sequences:

# Performance-optimized deployment for high throughput
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    -e VLLM_ATTENTION_BACKEND=FLASHINFER \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 8192 \
    --max-num-seqs 256 \
    --disable-log-requests

Higher --max-num-seqs improves throughput at the cost of per-request latency. The optimal value depends on your traffic pattern. Batch-oriented offline processing benefits from high values, while interactive applications need lower settings for consistent response times.

The --disable-log-requests flag reduces logging overhead in high-throughput scenarios. Enable during debugging, disable in production unless you need request-level logs.

Argument	Default	Description	Use Case
--model	Required	HuggingFace model ID or local path	All deployments
--tensor-parallel-size	1	Number of GPUs for tensor parallelism	Multi-GPU setups
--pipeline-parallel-size	1	Number of GPUs for pipeline parallelism	Multi-node setups
--gpu-memory-utilization	0.9	Target GPU memory usage (0.0-1.0)	Memory tuning
--max-model-len	Model default	Maximum context length	Memory saving
--max-num-seqs	256	Maximum concurrent sequences	Throughput tuning
--max-num-batched-tokens	Auto	Maximum tokens per batch	Memory/latency trade-off
--quantization	None	Quantization method (awq, gptq, fp8, squeezellm)	Memory reduction
--dtype	auto	Model data type (auto, half, float16, bfloat16)	Precision control
--host	localhost	Bind address	Container deployment
--port	8000	API port	Custom ports
--api-key	None	API authentication key	Basic security
--served-model-name	Model name	Custom model name in API	API customization
--disable-log-requests	False	Disable request logging	Production performance
--enable-prefix-caching	False	Enable automatic prefix caching	Repeated prompts

For production deployments handling hundreds of requests per second, also consider the VLLM_ATTENTION_BACKEND environment variable. The FlashInfer backend offers improved performance on newer GPUs and should be tested against the default FlashAttention backend for your specific hardware and model combination.

Troubleshooting Common Issues

vLLM Docker deployments encounter predictable failure modes. This section covers the most common errors and their solutions.

GPU Not Detected

Symptom: Container starts but fails with "No CUDA-capable device detected" or similar GPU errors.

Causes and Solutions:

Missing NVIDIA Container Toolkit: Reinstall following the prerequisites section. Verify with docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi.
Missing --gpus flag: Make sure your Docker run command includes --gpus all or --gpus '"device=0"'.
Driver mismatch: The container CUDA version must be compatible with your host NVIDIA driver. Check versions with nvidia-smi on the host and compare against the vLLM image requirements.
Docker Compose syntax: For Compose files, use the deploy.resources.reservations syntax rather than --gpus flags.

CUDA Out of Memory

Symptom: Model fails to load or crashes during inference with "CUDA out of memory" errors.

Solutions:

Reduce --gpu-memory-utilization from 0.9 to 0.8 or lower
Enable quantization with --quantization awq or --quantization gptq
Increase --tensor-parallel-size to spread the model across more GPUs
Reduce --max-model-len to lower KV cache requirements
Use a smaller model if hardware is genuinely insufficient

Model Loading Failures

Symptom: Container fails during model download with 401 errors or "model not found" messages.

Solutions:

Set HF_TOKEN for gated models like Llama. Make sure the token has read permissions.
Accept the model license on HuggingFace before attempting to download.
Check network connectivity from within the container. Some corporate networks block HuggingFace.
Verify the model name is spelled correctly and exists on HuggingFace.

Error Message	Cause	Solution
CUDA out of memory	Model too large for GPU memory	Use quantization, increase --tensor-parallel-size, or reduce --max-model-len
GPU not detected / No CUDA-capable device	Missing NVIDIA runtime	Install nvidia-container-toolkit, use --gpus all flag
Model not found / 401 Unauthorized	Gated model without token	Set HF_TOKEN environment variable
KeyboardInterrupt: terminated	K8s probe killed container	Increase probe initialDelaySeconds to 300+
IPC shared memory error	Missing IPC flag	Add --ipc=host or --shm-size=1g
Connection refused on port 8000	Server not ready	Wait for model loading (check logs), verify port mapping
torch.cuda.OutOfMemoryError	Insufficient VRAM during inference	Lower --gpu-memory-utilization to 0.8
ValueError: Cannot use FlashAttention	Incompatible GPU architecture	Use --enforce-eager or upgrade GPU
RuntimeError: NCCL error	Multi-GPU communication failure	Check NVLink status, try --disable-custom-all-reduce
HTTPError: 403 Forbidden	Model license not accepted	Accept license on HuggingFace model page

Kubernetes Probe Timeouts

Symptom: Pod repeatedly restarts with "KeyboardInterrupt: terminated" in logs.

Cause: Kubernetes kills the container because health probes fail during model loading.

Solution: Increase initialDelaySeconds to 300 or higher on both liveness and readiness probes. Large models like Llama-2-70B can take 5+ minutes to load even with fast storage.

Slow Inference Performance

Symptom: Model responds but latency is higher than expected based on benchmarks.

Causes and Solutions:

Missing CUDA graphs: By default, vLLM compiles CUDA graphs for faster execution. The first few requests after startup are slow while graphs compile. This is normal.
Suboptimal batch configuration: Very low --max-num-seqs limits concurrent processing. Increase to allow better GPU utilization.
Insufficient GPU memory: When --gpu-memory-utilization is too low, the KV cache cannot hold enough sequences for efficient batching. Increase utilization if memory allows.
Network overhead: For remote model storage or slow HuggingFace downloads, subsequent inference is fast but initial requests hit network latency. Use local model caching to eliminate this delay.

Conclusion

Deploying vLLM with Docker turns complex LLM infrastructure into manageable, reproducible configurations. From single-GPU development setups to multi-GPU production stacks with monitoring, containerization provides the consistency and portability modern teams expect.

The key takeaways from this guide:

Start with the official vllm/vllm-openai image and essential flags (--gpus all, --ipc=host, volume mounts)
Use Docker Compose to manage configuration complexity and enable rapid iteration
Scale to multi-GPU with tensor parallelism for models exceeding single-GPU memory
Build production stacks with Prometheus and Grafana for operational visibility
Deploy to Kubernetes with extended probe timeouts to accommodate model loading

For teams who want the power of vLLM without managing Docker infrastructure, there's an easier path.

Skip the Docker complexity. Deploy vLLM on Inference.net in minutes.

Inference.net offers fully managed vLLM deployments with automatic scaling, built-in monitoring, and enterprise-grade reliability. Focus on building your application while Inference.net handles the infrastructure.

vLLM Docker Deployment: Production-Ready Setup Guide (2026)

Introduction

What is vLLM and Why Docker?

vLLM Overview

Benefits of Containerized Deployment

Prerequisites

System Requirements

Installing NVIDIA Container Toolkit

Quick Start: Single-GPU Setup

Pulling the Official Image

Running vLLM with Docker

Verifying the Deployment

Docker Compose for Development

Basic docker-compose.yml

Environment Variables Configuration

Volume Mounting for Model Persistence

Multi-GPU Configurations

Understanding Tensor Parallelism

Docker with Multiple GPUs

Memory Optimization Strategies

Production Docker Compose Stack

vLLM with Prometheus and Grafana

Health Checks and Restart Policies

Resource Limits and Reservations

Kubernetes Deployment

Basic Deployment YAML

Service and Ingress Configuration

Horizontal Pod Autoscaling

Monitoring and Observability

Key Prometheus Metrics

Grafana Dashboard Setup

Alerting Best Practices

Performance Tuning

GPU Memory Utilization

Quantization in Containers

Batch Size Optimization

Troubleshooting Common Issues

GPU Not Detected

CUDA Out of Memory

Model Loading Failures

Kubernetes Probe Timeouts

Slow Inference Performance

Conclusion

Meet with our research team