IntermediatevLLMPythonLLMInferenceGPUCUDAOpenAI APIPagedAttentionModel ServingPyTorchQuantizationDistributed Inference

vLLM: High-Throughput LLM Inference and Serving Setup

Complete setup guide for vLLM - a high-throughput and memory-efficient inference and serving engine for Large Language Models. Features PagedAttention for optimized memory management, continuous batching for maximum GPU utilization, and OpenAI-compatible API server. Supports 200+ model architectures with quantization, distributed inference, and multi-GPU deployment.

Step 1
System Prerequisites
vLLM is optimized for high-performance LLM inference on GPU-accelerated systems. Before installation, verify your environment meets the requirements. vLLM works best on Linux with NVIDIA GPUs, though experimental support exists for AMD GPUs, TPUs, and other accelerators.
```
# Check Python version (3.9-3.12 supported, 3.12 recommended)
python --version

# Check CUDA availability and version
nvidia-smi

# Check GPU compute capability (7.0+ recommended)
nvidia-smi --query-gpu=compute_cap --format=csv

# Verify system architecture
uname -m  # x86_64 recommended
```
⚠ Heads up: vLLM requires an NVIDIA GPU with CUDA for optimal performance. CPU-only mode is experimental and significantly slower. For production workloads, use GPUs with at least 16GB VRAM.

Step 2

Installation via pip (Recommended)

The simplest installation method is via pip, which automatically handles PyTorch and CUDA dependencies. vLLM ships with pre-compiled binaries for CUDA 12.1 by default, with CUDA 11.8 binaries also available.

# Create a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM with CUDA 12.1 (default)
pip install vllm

# Or install with CUDA 11.8
export VLLM_VERSION=0.6.5
export PYTHON_VERSION=312
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl

# Verify installation
python -c "import vllm; print(vllm.__version__)"

Step 3
Installation via Docker (Production Recommended)
Docker provides the most reliable deployment method, especially for production environments. The official vLLM Docker images come pre-configured with all dependencies, CUDA drivers, and optimized kernels.
```
# Pull the latest vLLM Docker image
docker pull vllm/vllm-openai:latest

# Run a simple test to verify GPU access
docker run --gpus all \
  vllm/vllm-openai:latest \
  python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Run interactive container for testing
docker run --gpus all -it \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  /bin/bash
```
⚠ Heads up: Ensure NVIDIA Container Toolkit is installed on your host system. Install with: sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker

Step 4

Installation from Source (For Development)

Building from source gives you access to the latest features and allows custom compilation flags. This method is recommended for contributors, researchers, or when you need bleeding-edge features.

# Install build dependencies
pip install --upgrade pip
pip install wheel packaging ninja setuptools-scm>=8

# Clone the repository
git clone https://github.com/vllm-project/vllm.git
cd vllm

# Install in editable mode (for development)
pip install -e .

# Or build and install normally
pip install .

# Verify the installation
vllm --version

Step 5

Basic Offline Inference

The simplest way to use vLLM is through its Python API for offline batch inference. This mode is ideal for processing large datasets, benchmarking, or when you don't need a server.

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

# Generate outputs
prompts = [
    "Explain quantum computing in simple terms:",
    "Write a Python function to reverse a string:",
]

outputs = llm.generate(prompts, sampling_params)

# Print results
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(f"Generated: {generated_text}\n")

Step 6

Launch OpenAI-Compatible API Server

vLLM's most powerful feature is its OpenAI-compatible API server, allowing drop-in replacement for OpenAI services. This is the recommended deployment mode for production applications.

# Basic server launch
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# Production-optimized launch
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096 \
  --max-num-seqs 256 \
  --api-key sk-your-secret-key \
  --served-model-name llama-3.2-3b

⚠ Heads up: The --api-key parameter is critical for production. Never use the default 'token-abc123' in exposed deployments. Generate a secure key with: openssl rand -base64 32

Step 7

Docker-based API Server Deployment

Running the API server in Docker provides isolation, reproducibility, and simplified deployment. This is the recommended production setup, especially when using orchestration platforms like Kubernetes.

# Run vLLM API server with Docker
docker run -d \
  --name vllm-server \
  --gpus all \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=<your-hf-token> \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.9

# Check server logs
docker logs -f vllm-server

# Test the server
curl http://localhost:8000/v1/models

Step 8

Query the API Server

Once the server is running, you can query it using curl, the OpenAI Python SDK, or any HTTP client. The API is fully compatible with OpenAI's specification, making migration seamless.

# Test with curl - chat completions
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-secret-key" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain PagedAttention"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

# List available models
curl http://localhost:8000/v1/models

Step 9

Use OpenAI Python SDK with vLLM

The easiest way to integrate vLLM into existing applications is via the OpenAI Python SDK. Simply change the base_url to point to your vLLM server - no other code changes required.

from openai import OpenAI

# Configure client to use vLLM server
client = OpenAI(
    api_key="sk-your-secret-key",
    base_url="http://localhost:8000/v1"
)

# Chat completions (streaming)
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about AI"}
    ],
    stream=True,
    temperature=0.7
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

print()

Step 10

Multi-GPU Deployment with Tensor Parallelism

For larger models that don't fit in a single GPU, vLLM supports tensor parallelism to distribute the model across multiple GPUs. This enables serving models like Llama-70B or Mixtral-8x7B efficiently.

# Launch with 4-way tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --host 0.0.0.0 \
  --port 8000

# Docker deployment with multi-GPU
docker run -d \
  --gpus '"device=0,1,2,3"' \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# Check GPU utilization
watch -n 1 nvidia-smi

⚠ Heads up: Tensor parallelism requires GPUs to be on the same node with fast interconnects (NVLink recommended). For pipeline parallelism across nodes, use --pipeline-parallel-size instead.

Step 11

Model Quantization for Memory Efficiency

vLLM supports various quantization methods to reduce memory footprint and increase throughput. Quantization can reduce model size by 50-75% with minimal quality loss, enabling larger batch sizes.

# AWQ 4-bit quantization (pre-quantized model)
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --dtype half

# GPTQ 4-bit quantization
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-70B-GPTQ \
  --quantization gptq

# FP8 quantization (NVIDIA H100/Ada GPUs)
python -m vllm.entrypoints.openai.api_server \
  --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
  --quantization fp8

# Check model memory usage
python -c "from vllm import LLM; llm = LLM(model='meta-llama/Llama-3.2-3B-Instruct'); print(f'Memory: {llm.llm_engine.model_executor.driver_worker.model_runner.model.get_memory_footprint() / 1e9:.2f} GB')"

Step 12

Advanced Configuration Options

Fine-tune vLLM behavior with advanced parameters for memory management, scheduling, and performance optimization. These settings can dramatically impact throughput and latency based on your workload.

# Full production configuration example
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 8192 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 8192 \
  --enable-prefix-caching \
  --disable-log-requests \
  --trust-remote-code \
  --dtype bfloat16 \
  --kv-cache-dtype auto \
  --max-parallel-loading-workers 4

Step 13

Configure Sampling Parameters

Control generation behavior with sampling parameters that affect output quality, diversity, and speed. These can be set per-request or as server defaults.

from vllm import SamplingParams

# Deterministic sampling (temperature=0)
sampling_params = SamplingParams(
    temperature=0.0,
    top_p=1.0,
    max_tokens=1024
)

# Creative sampling
sampling_params = SamplingParams(
    temperature=0.9,
    top_p=0.95,
    top_k=50,
    max_tokens=2048,
    presence_penalty=0.1,
    frequency_penalty=0.1
)

# Constrained generation with stop sequences
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
    stop=["\n\n", "```", "END"],
    skip_special_tokens=True
)

Step 14
Understanding PagedAttention
PagedAttention is vLLM's core innovation - it manages KV cache memory in non-contiguous blocks (like virtual memory paging) instead of pre-allocating large contiguous chunks. This eliminates memory fragmentation and enables dynamic memory allocation, resulting in 4x higher throughput compared to traditional approaches. Combined with continuous batching, vLLM maximizes GPU utilization by constantly filling available compute with new requests as others complete.
```
# Monitor PagedAttention block allocation
# The server logs show block utilization in real-time
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --block-size 16 \
  --num-gpu-blocks-override 2048 \
  --log-level debug

# Key metrics to watch:
# - GPU blocks used / total
# - Number of running sequences
# - Batched token throughput
```

Step 15

Production Deployment Best Practices

For production deployments, implement proper security, monitoring, and scaling strategies. Use a reverse proxy for TLS termination, implement request rate limiting, and monitor GPU utilization and error rates.

# NGINX reverse proxy configuration
server {
    listen 443 ssl http2;
    server_name api.yourdomain.com;

    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        
        # Increase timeout for long generations
        proxy_read_timeout 300s;
        proxy_connect_timeout 75s;
    }
}

Step 16

Monitoring and Observability

vLLM exposes Prometheus metrics for monitoring server health, request latency, GPU utilization, and throughput. Integrate with your existing observability stack for production monitoring.

# Enable Prometheus metrics endpoint
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --disable-log-requests \
  --uvicorn-log-level warning

# Metrics available at http://localhost:8000/metrics
curl http://localhost:8000/metrics

# Key metrics:
# - vllm:num_requests_running
# - vllm:num_requests_waiting
# - vllm:gpu_cache_usage_perc
# - vllm:time_to_first_token_seconds
# - vllm:time_per_output_token_seconds

Step 17

Benchmarking Performance

Measure your vLLM deployment's throughput and latency characteristics using the built-in benchmark tool. This helps tune configuration parameters for your specific hardware and workload.

# Install benchmark dependencies
pip install vllm[benchmark]

# Run throughput benchmark
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct &

sleep 30  # Wait for server startup

# Benchmark with various request rates
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "prompt": "Once upon a time",
    "max_tokens": 100
  }' &

# Use Apache Bench for load testing
ab -n 1000 -c 10 -p request.json \
  -T application/json \
  http://localhost:8000/v1/completions

Step 18

Common Use Cases and Patterns

vLLM excels at various LLM serving scenarios. Here are practical examples for common production patterns including batch processing, real-time chat, and function calling.

# Use Case 1: Batch document summarization
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct")
documents = [...]  # Load your documents
prompts = [f"Summarize: {doc}" for doc in documents]
outputs = llm.generate(prompts, SamplingParams(max_tokens=200))

# Use Case 2: Real-time streaming chat
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-key")
for chunk in client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
):
    print(chunk.choices[0].delta.content or "", end="")

# Use Case 3: Embeddings generation
response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input="Your text to embed"
)
vector = response.data[0].embedding

Step 19

Troubleshooting Common Issues

Solutions to frequently encountered problems when deploying vLLM. Most issues relate to GPU memory, CUDA compatibility, or model loading errors.

# Issue: CUDA out of memory
# Solution: Reduce gpu-memory-utilization or max-num-seqs
python -m vllm.entrypoints.openai.api_server \
  --model <model> \
  --gpu-memory-utilization 0.8 \
  --max-num-seqs 128

# Issue: Model download fails
# Solution: Set Hugging Face token
export HUGGING_FACE_HUB_TOKEN=<your-token>

# Issue: Slow first request
# Solution: This is expected - vLLM loads model on first request
# Use --preload-models to load at startup

# Issue: CUDA version mismatch
# Solution: Check CUDA version and install matching vLLM build
nvcc --version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

# Enable debug logging
python -m vllm.entrypoints.openai.api_server \
  --model <model> \
  --log-level debug

⚠ Heads up: If you encounter 'ImportError: cannot import name X from vllm', your installation may be corrupted. Uninstall with pip uninstall vllm -y and reinstall from scratch.

Step 20

Updating vLLM

vLLM is actively developed with frequent releases adding new features, model support, and performance improvements. Stay updated to benefit from the latest optimizations and bug fixes.

# Update via pip
pip install --upgrade vllm

# Update Docker image
docker pull vllm/vllm-openai:latest

# Update from source
cd vllm
git pull origin main
pip install -e .

# Check current version
python -c "import vllm; print(vllm.__version__)"

# View changelog
curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | grep body