Announcing our $11.8M Series Seed.

    Read more
    Content Post Hero

    Jan 31, 2026

    vLLM Docker Deployment: Production-Ready Setup Guide (2026)

    Inference Research

    Introduction

    Getting large language models into production is tricky. You need good performance, you can't waste GPU memory, and the whole thing has to be manageable by your ops team. vLLM has become the default choice for teams serious about LLM serving because it delivers up to 24x higher throughput than HuggingFace Transformers through its PagedAttention algorithm. Docker makes deployment practical by giving you reproducibility, isolation from host dependencies, and easy integration with orchestration platforms.

    vLLM Docker deployment is the process of containerizing the vLLM inference engine for production LLM serving. Using the official vllm/vllm-openai Docker image, organizations can deploy high-throughput language model inference with PagedAttention memory optimization, OpenAI-compatible APIs, and GPU acceleration via NVIDIA Container Toolkit.

    This guide walks you through the complete vLLM deployment journey. We'll start with a single-GPU quickstart, move through Docker Compose configurations for development and production, explore multi-GPU setups with tensor parallelism, and finish with Kubernetes deployment patterns and monitoring integration. Whether you're running experiments locally or preparing for production scale, this guide covers the configurations you need.

    What is vLLM and Why Docker?

    Before diving into deployment specifics, it helps to understand vLLM's architecture and why containerization makes sense for these workloads.

    vLLM Overview

    vLLM is an open-source inference and serving engine for large language models, originally developed at UC Berkeley's Sky Computing Lab. The core innovation is PagedAttention, an attention algorithm inspired by operating system virtual memory concepts. Traditional inference engines allocate contiguous memory blocks for key-value caches, wasting 60-80% of allocated memory due to fragmentation. PagedAttention partitions the KV cache into non-contiguous blocks, reducing memory waste to under 4%.

    This memory efficiency translates directly to throughput gains. vLLM achieves 14-24x higher throughput than HuggingFace Transformers and 2-4x higher throughput than Text Generation Inference (TGI). The engine supports continuous batching, CUDA graph optimization, and multiple quantization formats including AWQ, GPTQ, and FP8.

    Benefits of Containerized Deployment

    Docker containers give you several advantages for LLM inference workloads:

    • Environment consistency: The same container runs identically across development, staging, and production environments
    • Dependency isolation: CUDA libraries, Python packages, and system dependencies are bundled together, eliminating version conflicts
    • Reproducible builds: Pin image tags to keep deployments consistent over time
    • Orchestration ready: Containers integrate natively with Kubernetes, Docker Swarm, and other orchestration platforms
    • Resource management: Docker's cgroup integration enables fine-grained CPU, memory, and GPU allocation

    The official vllm/vllm-openai image includes everything needed for production inference: the vLLM engine, an OpenAI-compatible API server, and optimized CUDA kernels. This eliminates hours of manual dependency management and gives you a tested, working configuration from the start.

    Prerequisites

    Before deploying vLLM in Docker, make sure your environment meets the hardware and software requirements. Running LLM inference requires significant GPU resources, and proper driver configuration is essential for container GPU access.

    System Requirements

    vLLM requires an NVIDIA GPU with compute capability 7.0 or higher. This includes V100, T4, A10, A100, and H100 GPUs. For production workloads, we recommend GPUs with at least 24GB of VRAM to handle models like Llama-3.1-8B comfortably. Larger models like Llama-2-70B require multiple GPUs or quantization.

    Your host system needs a Linux kernel 5.4 or later. Ubuntu 22.04 LTS is the recommended distribution for production deployments. Windows users can experiment with WSL2, but production deployments should target native Linux.

    RequirementMinimumRecommendedNotes
    NVIDIA GPUCompute 7.0+ (V100)A100/H100Check with nvidia-smi
    GPU Memory8GB24GB+Per GPU
    Docker Engine20.1024.0+docker --version
    NVIDIA Driver525+Latest stablenvidia-smi
    NVIDIA Container Toolkit1.13+Latestnvidia-ctk --version
    OSLinux (kernel 5.4+)Ubuntu 22.04Windows WSL2 experimental
    Disk Space50GB200GB+For model caching
    RAM16GB64GB+For model loading overhead

    Installing NVIDIA Container Toolkit

    The NVIDIA Container Toolkit enables Docker containers to access host GPUs. Install it using your distribution's package manager:

    # Add NVIDIA package repository
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
        sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
        sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    
    # Install the toolkit
    sudo apt-get update
    sudo apt-get install -y nvidia-container-toolkit
    
    # Configure Docker runtime
    sudo nvidia-ctk runtime configure --runtime=docker
    sudo systemctl restart docker

    Verify GPU access works inside containers:

    docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi

    You should see your GPU listed in the output. If this command fails, check your NVIDIA driver installation and make sure the container toolkit is properly configured.

    Quick Start: Single-GPU Setup

    Getting vLLM running in Docker takes just a few commands. This section covers pulling the official image, running the server, and verifying your deployment works correctly.

    Pulling the Official Image

    The vllm/vllm-openai image provides a production-ready OpenAI-compatible API server. Pull the latest version:

    # Pull the official vLLM OpenAI-compatible server image
    docker pull vllm/vllm-openai:latest
    
    # Verify the image was downloaded
    docker images | grep vllm

    The image is approximately 9GB and includes all dependencies for GPU inference. For production deployments, pin to a specific version tag rather than latest to get reproducible builds.

    Running vLLM with Docker

    The following command starts vLLM with a Llama model using the vllm serve functionality exposed through the container:

    docker run --runtime nvidia --gpus all \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        -p 8000:8000 \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model meta-llama/Llama-3.1-8B-Instruct \
        --host 0.0.0.0 \
        --port 8000

    Each flag serves a specific purpose in enabling GPU inference:

    FlagPurposeRequiredExample
    --runtime nvidiaUse NVIDIA container runtimeYes--runtime nvidia
    --gpus allEnable access to all GPUsYes--gpus all
    --gpus '"device=0,1"'Enable specific GPUsAlternative--gpus '"device=0,1"'
    --ipc=hostShare host IPC namespace for multiprocessingYes--ipc=host
    -v ~/.cache/huggingface:/root/.cache/huggingfacePersist downloaded modelsRecommendedSee example
    -p 8000:8000Expose vLLM API portYes-p 8000:8000
    -e HF_TOKEN=xxxHuggingFace token for gated modelsConditional-e HF_TOKEN=hf_xxx
    --shm-size=1gSet shared memory size (alternative to --ipc=host)Optional--shm-size=1g
    --name vllmAssign container nameRecommended--name vllm-server
    -dRun in detached modeOptional-d
    --restart unless-stoppedAuto-restart policyProduction--restart unless-stopped

    The --ipc=host flag is particularly important. vLLM uses shared memory for communication between processes during inference. Without this flag, you'll hit cryptic IPC errors during model loading or inference.

    Warning: The --ipc=host flag shares the host's IPC namespace with the container. For environments with strict security requirements, use --shm-size=1g as an alternative, though this may limit performance with very large batch sizes.

    Verifying the Deployment

    Wait for the model to load. First startup downloads the model weights from HuggingFace, which can take several minutes depending on your connection speed. Subsequent startups load from the cached volume.

    Follow these steps to deploy vLLM in Docker:

    1. Install Docker Engine (20.10+) on your Linux host
    2. Install NVIDIA Container Toolkit for GPU support
    3. Pull the official image: docker pull vllm/vllm-openai:latest
    4. Run with GPU access: docker run --runtime nvidia --gpus all
    5. Add required flags: --ipc=host for shared memory
    6. Mount HuggingFace cache for model persistence
    7. Expose port 8000 for API access
    8. Verify with health check: curl http://localhost:8000/health

    Check the health endpoint:

    curl http://localhost:8000/health

    A healthy server returns a 200 status code. For more detail, use the readiness endpoint:

    curl http://localhost:8000/health/ready

    Test the API with a chat completion request:

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "meta-llama/Llama-3.1-8B-Instruct",
            "messages": [
                {"role": "user", "content": "Hello, how are you?"}
            ],
            "max_tokens": 100
        }'

    You should receive a JSON response with the model's completion. The API is fully OpenAI-compatible, so existing OpenAI client libraries work without modification.

    Docker Compose for Development

    Managing Docker run commands with many flags gets unwieldy during development. Docker Compose provides a cleaner way to define your vLLM configuration and makes iterating on settings much faster.

    Basic docker-compose.yml

    Create a docker-compose.yml file for your development environment:

    version: "3.8"
    
    services:
      vllm:
        image: vllm/vllm-openai:latest
        runtime: nvidia
        ports:
          - "8000:8000"
        volumes:
          - huggingface-cache:/root/.cache/huggingface
        environment:
          - HF_TOKEN=${HF_TOKEN}
        ipc: host
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: 1
                  capabilities: [gpu]
        command: >
          --model meta-llama/Llama-3.1-8B-Instruct
          --host 0.0.0.0
          --port 8000
          --gpu-memory-utilization 0.9
    
    volumes:
      huggingface-cache:

    Start the service with:

    docker compose up

    The Compose file centralizes your configuration, making it easy to track changes in version control and share setups across your team.

    Environment Variables Configuration

    Sensitive values like HuggingFace tokens should not appear in your Compose file. Create a .env file in the same directory:

    # .env file for Docker Compose
    HF_TOKEN=hf_your_huggingface_token_here
    VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct

    Docker Compose automatically loads variables from .env. Add this file to your .gitignore to prevent committing secrets.

    For gated models like Llama, you must set HF_TOKEN to a valid HuggingFace access token. Generate one at https://huggingface.co/settings/tokens and make sure you've accepted the model's license agreement.

    Volume Mounting for Model Persistence

    The named volume huggingface-cache persists downloaded models between container restarts. Without this volume, Docker downloads model weights on every startup, wasting bandwidth and time.

    Named volumes have advantages over bind mounts for model storage. Docker manages the volume location, permissions work correctly without manual configuration, and the volume persists even if you remove the container.

    For development, you might prefer a bind mount to share a local cache:

    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface

    This approach lets you use models downloaded outside Docker and saves disk space by avoiding duplicate caches.

    Multi-GPU Configurations

    Single-GPU deployments hit limits quickly with larger models. A Llama-2-70B model requires approximately 140GB of memory in FP16, far exceeding any single GPU. Multi-GPU configurations with tensor parallelism distribute the model across devices, letting you deploy models that would otherwise be impossible.

    Understanding Tensor Parallelism

    Tensor parallelism splits model layers across multiple GPUs within a single node. Each GPU holds a portion of each layer's weights and computes its share of the output. The GPUs communicate intermediate results through high-bandwidth interconnects like NVLink.

    [@portabletext/react] Unknown block type "mermaid", specify a component for it in the `components.types` prop

    This approach works well when your model exceeds single-GPU memory but fits across available devices. For a 70B parameter model, tensor parallelism across 4 A100-80GB GPUs puts roughly 35GB of weights on each device, leaving headroom for KV cache and activations.

    The trade-off is communication overhead. Every forward pass requires synchronization between GPUs, adding latency. NVLink reduces this overhead significantly compared to PCIe, so make sure your hardware configuration supports direct GPU-to-GPU links.

    vLLM also supports pipeline parallelism for distributing layers sequentially across GPUs and expert parallelism for Mixture-of-Experts models like Mixtral. But tensor parallelism remains the most common choice for single-node multi-GPU deployments because it balances simplicity and performance well.

    Docker with Multiple GPUs

    Enable tensor parallelism by adding the --tensor-parallel-size flag:

    # Run Llama-2-70B with tensor parallelism across 4 GPUs
    docker run --runtime nvidia --gpus all \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        -p 8000:8000 \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model meta-llama/Llama-2-70b-hf \
        --tensor-parallel-size 4 \
        --gpu-memory-utilization 0.9 \
        --host 0.0.0.0

    vLLM automatically distributes the model across the specified number of GPUs. The --gpus all flag makes all host GPUs visible to the container.

    To select specific GPUs, use NVIDIA's device specification syntax:

    # Use only GPUs 0 and 1
    docker run --runtime nvidia --gpus '"device=0,1"' \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        -p 8000:8000 \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model meta-llama/Llama-2-13b-hf \
        --tensor-parallel-size 2

    The quote escaping is intentional. Docker requires the device specification to be passed as a JSON string.

    Memory Optimization Strategies

    Even with multiple GPUs, memory pressure remains a concern. Several strategies help maximize available headroom.

    Quantization reduces model precision to save memory. AWQ and GPTQ provide 4-bit quantization with minimal quality loss:

    # Deploy a 4-bit quantized model (AWQ)
    docker run --runtime nvidia --gpus all \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        -p 8000:8000 \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model TheBloke/Llama-2-70B-AWQ \
        --quantization awq \
        --tensor-parallel-size 2 \
        --gpu-memory-utilization 0.95

    Quantized models fit on fewer GPUs and leave more memory for KV cache, improving batch sizes and throughput.

    Limiting context length with --max-model-len reduces KV cache requirements. If your application only needs 4096-token contexts, setting this limit prevents vLLM from allocating cache for the model's full context window.

    Tuning --gpu-memory-utilization controls how much GPU memory vLLM reserves. The default of 0.9 leaves 10% headroom for CUDA kernels and fragmentation. Increase to 0.95 for memory-constrained setups, but watch for out-of-memory errors under load.

    Production Docker Compose Stack

    Development configurations lack the resilience and observability needed for production. A production stack adds monitoring, proper health checks, restart policies, and resource constraints.

    vLLM with Prometheus and Grafana

    The following Docker Compose file deploys vLLM alongside Prometheus for metrics collection and Grafana for visualization:

    [@portabletext/react] Unknown block type "mermaid", specify a component for it in the `components.types` prop
    version: "3.8"
    
    services:
      vllm:
        image: vllm/vllm-openai:latest
        runtime: nvidia
        ports:
          - "8000:8000"
        volumes:
          - huggingface-cache:/root/.cache/huggingface
        environment:
          - HF_TOKEN=${HF_TOKEN}
        ipc: host
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]
            limits:
              memory: 32G
        command: >
          --model meta-llama/Llama-3.1-8B-Instruct
          --host 0.0.0.0
          --port 8000
          --tensor-parallel-size 2
          --gpu-memory-utilization 0.9
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
          interval: 30s
          timeout: 10s
          retries: 3
          start_period: 300s
        restart: unless-stopped
        networks:
          - vllm-network
    
      prometheus:
        image: prom/prometheus:latest
        ports:
          - "9090:9090"
        volumes:
          - ./prometheus.yml:/etc/prometheus/prometheus.yml
          - prometheus-data:/prometheus
        command:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
        restart: unless-stopped
        networks:
          - vllm-network
    
      grafana:
        image: grafana/grafana:latest
        ports:
          - "3000:3000"
        volumes:
          - grafana-data:/var/lib/grafana
        environment:
          - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
        restart: unless-stopped
        networks:
          - vllm-network
        depends_on:
          - prometheus
    
    networks:
      vllm-network:
        driver: bridge
    
    volumes:
      huggingface-cache:
      prometheus-data:
      grafana-data:

    Create the Prometheus configuration file prometheus.yml:

    # prometheus.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    scrape_configs:
      - job_name: 'vllm'
        static_configs:
          - targets: ['vllm:8000']
        metrics_path: /metrics
        scrape_interval: 10s

    Health Checks and Restart Policies

    The health check configuration deserves special attention. LLM models take minutes to load, so the start_period must accommodate this startup time. A 300-second start period prevents Docker from killing the container before the model finishes loading.

    The restart: unless-stopped policy means the service recovers from crashes but allows manual stops for maintenance. For mission-critical deployments, consider restart: always with proper alerting on restart events.

    Resource Limits and Reservations

    The deploy.resources section controls container resource allocation. GPU reservations give vLLM exclusive access to required devices. Memory limits prevent runaway processes from affecting other services on the host.

    Set memory limits based on your model's requirements plus overhead for the Python runtime and HTTP server. For an 8B parameter model, 32GB provides comfortable headroom. Monitor actual usage and adjust accordingly.

    Tip: The count: all GPU reservation claims all available GPUs. For multi-tenant hosts, specify an explicit count or use device_ids to assign specific GPUs to each container.

    Kubernetes Deployment

    Kubernetes provides solid orchestration for production vLLM deployments. GPU scheduling, rolling updates, and integration with cluster-wide monitoring make Kubernetes the preferred platform for teams already invested in the ecosystem.

    Basic Deployment YAML

    The following manifest deploys vLLM with GPU resources, proper probes, and secrets management:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm
      labels:
        app: vllm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm
      template:
        metadata:
          labels:
            app: vllm
        spec:
          containers:
            - name: vllm
              image: vllm/vllm-openai:latest
              args:
                - "--model"
                - "meta-llama/Llama-3.1-8B-Instruct"
                - "--host"
                - "0.0.0.0"
                - "--port"
                - "8000"
                - "--gpu-memory-utilization"
                - "0.9"
              ports:
                - containerPort: 8000
              resources:
                limits:
                  nvidia.com/gpu: 1
                  memory: "32Gi"
                requests:
                  nvidia.com/gpu: 1
                  memory: "16Gi"
              env:
                - name: HF_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: token
              volumeMounts:
                - name: cache
                  mountPath: /root/.cache/huggingface
              livenessProbe:
                httpGet:
                  path: /health
                  port: 8000
                initialDelaySeconds: 300
                periodSeconds: 30
                timeoutSeconds: 10
              readinessProbe:
                httpGet:
                  path: /health/ready
                  port: 8000
                initialDelaySeconds: 300
                periodSeconds: 10
                timeoutSeconds: 5
          volumes:
            - name: cache
              persistentVolumeClaim:
                claimName: hf-cache-pvc

    Create the HuggingFace token secret before applying the deployment:

    kubectl create secret generic hf-secret --from-literal=token=hf_your_token_here

    Service and Ingress Configuration

    Expose the deployment with a Service:

    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-service
    spec:
      selector:
        app: vllm
      ports:
        - port: 8000
          targetPort: 8000
      type: ClusterIP

    For external access, create an Ingress or use a LoadBalancer service type. Consider adding authentication at the Ingress layer since vLLM does not provide built-in authentication.

    The ClusterIP type keeps the service internal to the cluster, which is appropriate when you route traffic through an API gateway or Ingress controller. For direct external access during development, change the type to LoadBalancer, though this exposes the API without authentication.

    Production deployments should implement authentication through an API gateway like Kong or Ambassador, or use an Ingress controller with OAuth2 proxy integration. vLLM accepts an --api-key flag for basic token authentication, but real authentication belongs at the infrastructure layer.

    Horizontal Pod Autoscaling

    Scaling LLM inference horizontally requires careful thought. Unlike stateless web services, each vLLM replica loads the full model into GPU memory. Scaling adds replicas rather than distributing load across existing resources.

    Standard CPU and memory metrics don't capture LLM serving load effectively. Instead, scale on custom metrics like queue depth:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: vllm-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: vllm
      minReplicas: 1
      maxReplicas: 4
      metrics:
        - type: Pods
          pods:
            metric:
              name: vllm_num_requests_waiting
            target:
              type: AverageValue
              averageValue: "50"

    This requires the Prometheus Adapter to expose vLLM metrics as Kubernetes custom metrics. Configure carefully to avoid over-scaling, which consumes expensive GPU resources.

    Note: The vLLM project provides Helm charts in examples/online_serving/chart-helm/ for more sophisticated deployments including StatefulSets and PodDisruptionBudgets.

    Monitoring and Observability

    vLLM exposes Prometheus metrics that reveal inference performance, resource utilization, and potential bottlenecks. Good monitoring catches problems before they impact users.

    Key Prometheus Metrics

    vLLM exposes metrics at the /metrics endpoint in Prometheus format. The most important metrics for operational visibility:

    MetricTypeDescriptionAlert Threshold
    vllm:num_requests_runningGaugeCurrently processing requestsN/A (info)
    vllm:num_requests_waitingGaugeRequests in queue> 100
    vllm:num_requests_swappedGaugeRequests swapped to CPU> 0
    vllm:gpu_cache_usage_percGaugeKV cache utilization> 95%
    vllm:cpu_cache_usage_percGaugeCPU swap cache utilization> 50%
    vllm:time_to_first_token_secondsHistogramTTFT latencyp99 > 5s
    vllm:time_per_output_token_secondsHistogramTPOT latencyp99 > 0.1s
    vllm:e2e_request_latency_secondsHistogramTotal request latencyp99 > 30s
    vllm:num_preemptions_totalCounterTotal request preemptionsIncreasing trend
    vllm:prompt_tokens_totalCounterTotal input tokens processedN/A (throughput)
    vllm:generation_tokens_totalCounterTotal output tokens generatedN/A (throughput)

    Request metrics show workload patterns. num_requests_running indicates active processing, while num_requests_waiting reveals queue buildup. A steadily growing waiting count signals insufficient capacity.

    Latency histograms expose user-facing performance. time_to_first_token_seconds measures how quickly users see initial output. e2e_request_latency_seconds captures total request time including generation.

    Cache metrics indicate memory pressure. gpu_cache_usage_perc approaching 100% means the KV cache is full, forcing request rejection or swapping.

    Grafana Dashboard Setup

    vLLM provides pre-built Grafana dashboards in the repository under examples/online_serving/dashboards/. Import these into your Grafana instance for immediate visibility.

    Configure Prometheus as a data source in Grafana, then import the dashboard JSON. The dashboards include panels for request rates, latency percentiles, cache utilization, and GPU metrics.

    For custom dashboards, start with these essential panels:

    • Request rate (requests/second)
    • p50/p95/p99 time-to-first-token latency
    • p50/p95/p99 end-to-end latency
    • Queue depth over time
    • GPU cache utilization percentage

    Time-to-first-token (TTFT) latency deserves particular attention for interactive applications. Users perceive responsiveness through how quickly the first token appears, not total generation time. A dashboard panel showing TTFT percentiles helps identify prefill bottlenecks that affect user experience.

    For batch processing workloads, focus on throughput metrics instead. Track tokens generated per second and requests completed per minute. These metrics reveal whether your deployment efficiently utilizes GPU resources during sustained load.

    Alerting Best Practices

    Configure alerts for conditions requiring operator attention:

    # alerts.yml
    groups:
      - name: vllm-alerts
        rules:
          - alert: VLLMHighQueueDepth
            expr: vllm:num_requests_waiting > 100
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High request queue depth"
              description: "vLLM has {{ $value }} requests waiting"
    
          - alert: VLLMHighCacheUsage
            expr: vllm:gpu_cache_usage_perc > 0.95
            for: 10m
            labels:
              severity: critical
            annotations:
              summary: "GPU cache near capacity"
              description: "KV cache utilization at {{ $value | humanizePercentage }}"
    
          - alert: VLLMHighLatency
            expr: histogram_quantile(0.99, vllm:e2e_request_latency_seconds_bucket) > 30
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High p99 latency"
              description: "p99 latency is {{ $value }}s"

    Set thresholds based on your SLOs. The examples above provide reasonable starting points, but adjust based on observed baseline performance and business requirements.

    Performance Tuning

    Default vLLM configurations work well for evaluation but leave performance on the table. Tuning memory utilization, enabling optimized backends, and right-sizing batch parameters can significantly improve throughput and latency.

    GPU Memory Utilization

    The --gpu-memory-utilization flag controls what fraction of GPU memory vLLM reserves for the KV cache and model weights. Higher values allow more concurrent sequences but risk out-of-memory errors under load.

    Start with the default 0.9 and increase incrementally while monitoring for OOM errors. For stable production deployments:

    --gpu-memory-utilization 0.85  # Conservative, very stable
    --gpu-memory-utilization 0.90  # Default, good balance
    --gpu-memory-utilization 0.95  # Aggressive, maximum throughput

    Higher utilization benefits batch workloads with predictable request patterns. Lower utilization provides safety margins for bursty traffic with variable sequence lengths.

    Quantization in Containers

    Quantization reduces model precision from 16-bit to 8-bit or 4-bit representations. This cuts memory requirements substantially, letting you run larger models or higher batch sizes on the same hardware.

    vLLM supports several quantization methods:

    • AWQ (Activation-aware Weight Quantization): 4-bit, minimal quality loss, fastest inference
    • GPTQ: 4-bit, good quality, widely available models
    • FP8: 8-bit float, excellent quality, requires Hopper/Ada GPUs

    Deploy with quantization:

    docker run --runtime nvidia --gpus all \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        -p 8000:8000 \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model TheBloke/Llama-2-70B-AWQ \
        --quantization awq \
        --gpu-memory-utilization 0.95

    Quantized models must be pre-quantized using the specified method. Check HuggingFace for AWQ or GPTQ versions of popular models.

    Batch Size Optimization

    vLLM uses continuous batching to process multiple requests simultaneously. The --max-num-seqs parameter limits concurrent sequences:

    # Performance-optimized deployment for high throughput
    docker run --runtime nvidia --gpus all \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        -p 8000:8000 \
        --ipc=host \
        -e VLLM_ATTENTION_BACKEND=FLASHINFER \
        vllm/vllm-openai:latest \
        --model meta-llama/Llama-3.1-8B-Instruct \
        --host 0.0.0.0 \
        --port 8000 \
        --tensor-parallel-size 2 \
        --gpu-memory-utilization 0.92 \
        --max-model-len 8192 \
        --max-num-seqs 256 \
        --disable-log-requests

    Higher --max-num-seqs improves throughput at the cost of per-request latency. The optimal value depends on your traffic pattern. Batch-oriented offline processing benefits from high values, while interactive applications need lower settings for consistent response times.

    The --disable-log-requests flag reduces logging overhead in high-throughput scenarios. Enable during debugging, disable in production unless you need request-level logs.

    ArgumentDefaultDescriptionUse Case
    --modelRequiredHuggingFace model ID or local pathAll deployments
    --tensor-parallel-size1Number of GPUs for tensor parallelismMulti-GPU setups
    --pipeline-parallel-size1Number of GPUs for pipeline parallelismMulti-node setups
    --gpu-memory-utilization0.9Target GPU memory usage (0.0-1.0)Memory tuning
    --max-model-lenModel defaultMaximum context lengthMemory saving
    --max-num-seqs256Maximum concurrent sequencesThroughput tuning
    --max-num-batched-tokensAutoMaximum tokens per batchMemory/latency trade-off
    --quantizationNoneQuantization method (awq, gptq, fp8, squeezellm)Memory reduction
    --dtypeautoModel data type (auto, half, float16, bfloat16)Precision control
    --hostlocalhostBind addressContainer deployment
    --port8000API portCustom ports
    --api-keyNoneAPI authentication keyBasic security
    --served-model-nameModel nameCustom model name in APIAPI customization
    --disable-log-requestsFalseDisable request loggingProduction performance
    --enable-prefix-cachingFalseEnable automatic prefix cachingRepeated prompts

    For production deployments handling hundreds of requests per second, also consider the VLLM_ATTENTION_BACKEND environment variable. The FlashInfer backend offers improved performance on newer GPUs and should be tested against the default FlashAttention backend for your specific hardware and model combination.

    Troubleshooting Common Issues

    vLLM Docker deployments encounter predictable failure modes. This section covers the most common errors and their solutions.

    GPU Not Detected

    Symptom: Container starts but fails with "No CUDA-capable device detected" or similar GPU errors.

    Causes and Solutions:

    1. Missing NVIDIA Container Toolkit: Reinstall following the prerequisites section. Verify with docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi.
    2. Missing --gpus flag: Make sure your Docker run command includes --gpus all or --gpus '"device=0"'.
    3. Driver mismatch: The container CUDA version must be compatible with your host NVIDIA driver. Check versions with nvidia-smi on the host and compare against the vLLM image requirements.
    4. Docker Compose syntax: For Compose files, use the deploy.resources.reservations syntax rather than --gpus flags.

    CUDA Out of Memory

    Symptom: Model fails to load or crashes during inference with "CUDA out of memory" errors.

    Solutions:

    1. Reduce --gpu-memory-utilization from 0.9 to 0.8 or lower
    2. Enable quantization with --quantization awq or --quantization gptq
    3. Increase --tensor-parallel-size to spread the model across more GPUs
    4. Reduce --max-model-len to lower KV cache requirements
    5. Use a smaller model if hardware is genuinely insufficient

    Model Loading Failures

    Symptom: Container fails during model download with 401 errors or "model not found" messages.

    Solutions:

    1. Set HF_TOKEN for gated models like Llama. Make sure the token has read permissions.
    2. Accept the model license on HuggingFace before attempting to download.
    3. Check network connectivity from within the container. Some corporate networks block HuggingFace.
    4. Verify the model name is spelled correctly and exists on HuggingFace.
    Error MessageCauseSolution
    CUDA out of memoryModel too large for GPU memoryUse quantization, increase --tensor-parallel-size, or reduce --max-model-len
    GPU not detected / No CUDA-capable deviceMissing NVIDIA runtimeInstall nvidia-container-toolkit, use --gpus all flag
    Model not found / 401 UnauthorizedGated model without tokenSet HF_TOKEN environment variable
    KeyboardInterrupt: terminatedK8s probe killed containerIncrease probe initialDelaySeconds to 300+
    IPC shared memory errorMissing IPC flagAdd --ipc=host or --shm-size=1g
    Connection refused on port 8000Server not readyWait for model loading (check logs), verify port mapping
    torch.cuda.OutOfMemoryErrorInsufficient VRAM during inferenceLower --gpu-memory-utilization to 0.8
    ValueError: Cannot use FlashAttentionIncompatible GPU architectureUse --enforce-eager or upgrade GPU
    RuntimeError: NCCL errorMulti-GPU communication failureCheck NVLink status, try --disable-custom-all-reduce
    HTTPError: 403 ForbiddenModel license not acceptedAccept license on HuggingFace model page

    Kubernetes Probe Timeouts

    Symptom: Pod repeatedly restarts with "KeyboardInterrupt: terminated" in logs.

    Cause: Kubernetes kills the container because health probes fail during model loading.

    Solution: Increase initialDelaySeconds to 300 or higher on both liveness and readiness probes. Large models like Llama-2-70B can take 5+ minutes to load even with fast storage.

    Slow Inference Performance

    Symptom: Model responds but latency is higher than expected based on benchmarks.

    Causes and Solutions:

    1. Missing CUDA graphs: By default, vLLM compiles CUDA graphs for faster execution. The first few requests after startup are slow while graphs compile. This is normal.
    2. Suboptimal batch configuration: Very low --max-num-seqs limits concurrent processing. Increase to allow better GPU utilization.
    3. Insufficient GPU memory: When --gpu-memory-utilization is too low, the KV cache cannot hold enough sequences for efficient batching. Increase utilization if memory allows.
    4. Network overhead: For remote model storage or slow HuggingFace downloads, subsequent inference is fast but initial requests hit network latency. Use local model caching to eliminate this delay.

    Conclusion

    Deploying vLLM with Docker turns complex LLM infrastructure into manageable, reproducible configurations. From single-GPU development setups to multi-GPU production stacks with monitoring, containerization provides the consistency and portability modern teams expect.

    The key takeaways from this guide:

    • Start with the official vllm/vllm-openai image and essential flags (--gpus all, --ipc=host, volume mounts)
    • Use Docker Compose to manage configuration complexity and enable rapid iteration
    • Scale to multi-GPU with tensor parallelism for models exceeding single-GPU memory
    • Build production stacks with Prometheus and Grafana for operational visibility
    • Deploy to Kubernetes with extended probe timeouts to accommodate model loading

    For teams who want the power of vLLM without managing Docker infrastructure, there's an easier path.

    Skip the Docker complexity. Deploy vLLM on Inference.net in minutes.

    Inference.net offers fully managed vLLM deployments with automatic scaling, built-in monitoring, and enterprise-grade reliability. Focus on building your application while Inference.net handles the infrastructure.

    Own your model. Scale with confidence.

    Schedule a call with our research team to learn more about custom training. We'll propose a plan that beats your current SLA and unit cost.