vLLM: High-Throughput LLM Inference and Serving Setup
Complete setup guide for vLLM - a high-throughput and memory-efficient inference and serving engine for Large Language Models. Features PagedAttention for optimized memory management, continuous batching for maximum GPU utilization, and OpenAI-compatible API server. Supports 200+ model architectures with quantization, distributed inference, and multi-GPU deployment.
- Step 1
System Prerequisites
vLLM is optimized for high-performance LLM inference on GPU-accelerated systems. Before installation, verify your environment meets the requirements. vLLM works best on Linux with NVIDIA GPUs, though experimental support exists for AMD GPUs, TPUs, and other accelerators.
# Check Python version (3.9-3.12 supported, 3.12 recommended) python --version # Check CUDA availability and version nvidia-smi # Check GPU compute capability (7.0+ recommended) nvidia-smi --query-gpu=compute_cap --format=csv # Verify system architecture uname -m # x86_64 recommended⚠ Heads up: vLLM requires an NVIDIA GPU with CUDA for optimal performance. CPU-only mode is experimental and significantly slower. For production workloads, use GPUs with at least 16GB VRAM. - Step 2
Installation via pip (Recommended)
The simplest installation method is via pip, which automatically handles PyTorch and CUDA dependencies. vLLM ships with pre-compiled binaries for CUDA 12.1 by default, with CUDA 11.8 binaries also available.
# Create a virtual environment (recommended) python -m venv vllm-env source vllm-env/bin/activate # Install vLLM with CUDA 12.1 (default) pip install vllm # Or install with CUDA 11.8 export VLLM_VERSION=0.6.5 export PYTHON_VERSION=312 pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl # Verify installation python -c "import vllm; print(vllm.__version__)" - Step 3
Installation via Docker (Production Recommended)
Docker provides the most reliable deployment method, especially for production environments. The official vLLM Docker images come pre-configured with all dependencies, CUDA drivers, and optimized kernels.
# Pull the latest vLLM Docker image docker pull vllm/vllm-openai:latest # Run a simple test to verify GPU access docker run --gpus all \ vllm/vllm-openai:latest \ python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" # Run interactive container for testing docker run --gpus all -it \ -v ~/.cache/huggingface:/root/.cache/huggingface \ vllm/vllm-openai:latest \ /bin/bash⚠ Heads up: Ensure NVIDIA Container Toolkit is installed on your host system. Install with: sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker - Step 4
Installation from Source (For Development)
Building from source gives you access to the latest features and allows custom compilation flags. This method is recommended for contributors, researchers, or when you need bleeding-edge features.
# Install build dependencies pip install --upgrade pip pip install wheel packaging ninja setuptools-scm>=8 # Clone the repository git clone https://github.com/vllm-project/vllm.git cd vllm # Install in editable mode (for development) pip install -e . # Or build and install normally pip install . # Verify the installation vllm --version - Step 5
Basic Offline Inference
The simplest way to use vLLM is through its Python API for offline batch inference. This mode is ideal for processing large datasets, benchmarking, or when you don't need a server.
from vllm import LLM, SamplingParams # Initialize the model llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct") # Define sampling parameters sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=256 ) # Generate outputs prompts = [ "Explain quantum computing in simple terms:", "Write a Python function to reverse a string:", ] outputs = llm.generate(prompts, sampling_params) # Print results for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt}") print(f"Generated: {generated_text}\n") - Step 6
Launch OpenAI-Compatible API Server
vLLM's most powerful feature is its OpenAI-compatible API server, allowing drop-in replacement for OpenAI services. This is the recommended deployment mode for production applications.
# Basic server launch python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 \ --port 8000 # Production-optimized launch python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --gpu-memory-utilization 0.9 \ --max-model-len 4096 \ --max-num-seqs 256 \ --api-key sk-your-secret-key \ --served-model-name llama-3.2-3b⚠ Heads up: The --api-key parameter is critical for production. Never use the default 'token-abc123' in exposed deployments. Generate a secure key with: openssl rand -base64 32 - Step 7
Docker-based API Server Deployment
Running the API server in Docker provides isolation, reproducibility, and simplified deployment. This is the recommended production setup, especially when using orchestration platforms like Kubernetes.
# Run vLLM API server with Docker docker run -d \ --name vllm-server \ --gpus all \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -e HUGGING_FACE_HUB_TOKEN=<your-hf-token> \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --gpu-memory-utilization 0.9 # Check server logs docker logs -f vllm-server # Test the server curl http://localhost:8000/v1/models - Step 8
Query the API Server
Once the server is running, you can query it using curl, the OpenAI Python SDK, or any HTTP client. The API is fully compatible with OpenAI's specification, making migration seamless.
# Test with curl - chat completions curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer sk-your-secret-key" \ -d '{ "model": "meta-llama/Llama-3.2-3B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain PagedAttention"} ], "temperature": 0.7, "max_tokens": 500 }' # List available models curl http://localhost:8000/v1/models - Step 9
Use OpenAI Python SDK with vLLM
The easiest way to integrate vLLM into existing applications is via the OpenAI Python SDK. Simply change the base_url to point to your vLLM server - no other code changes required.
from openai import OpenAI # Configure client to use vLLM server client = OpenAI( api_key="sk-your-secret-key", base_url="http://localhost:8000/v1" ) # Chat completions (streaming) stream = client.chat.completions.create( model="meta-llama/Llama-3.2-3B-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a haiku about AI"} ], stream=True, temperature=0.7 ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") print() - Step 10
Multi-GPU Deployment with Tensor Parallelism
For larger models that don't fit in a single GPU, vLLM supports tensor parallelism to distribute the model across multiple GPUs. This enables serving models like Llama-70B or Mixtral-8x7B efficiently.
# Launch with 4-way tensor parallelism python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 4 \ --host 0.0.0.0 \ --port 8000 # Docker deployment with multi-GPU docker run -d \ --gpus '"device=0,1,2,3"' \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 4 # Check GPU utilization watch -n 1 nvidia-smi⚠ Heads up: Tensor parallelism requires GPUs to be on the same node with fast interconnects (NVLink recommended). For pipeline parallelism across nodes, use --pipeline-parallel-size instead. - Step 11
Model Quantization for Memory Efficiency
vLLM supports various quantization methods to reduce memory footprint and increase throughput. Quantization can reduce model size by 50-75% with minimal quality loss, enabling larger batch sizes.
# AWQ 4-bit quantization (pre-quantized model) python -m vllm.entrypoints.openai.api_server \ --model TheBloke/Llama-2-70B-AWQ \ --quantization awq \ --dtype half # GPTQ 4-bit quantization python -m vllm.entrypoints.openai.api_server \ --model TheBloke/Llama-2-70B-GPTQ \ --quantization gptq # FP8 quantization (NVIDIA H100/Ada GPUs) python -m vllm.entrypoints.openai.api_server \ --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \ --quantization fp8 # Check model memory usage python -c "from vllm import LLM; llm = LLM(model='meta-llama/Llama-3.2-3B-Instruct'); print(f'Memory: {llm.llm_engine.model_executor.driver_worker.model_runner.model.get_memory_footprint() / 1e9:.2f} GB')" - Step 12
Advanced Configuration Options
Fine-tune vLLM behavior with advanced parameters for memory management, scheduling, and performance optimization. These settings can dramatically impact throughput and latency based on your workload.
# Full production configuration example python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --gpu-memory-utilization 0.95 \ --max-model-len 8192 \ --max-num-seqs 512 \ --max-num-batched-tokens 8192 \ --enable-prefix-caching \ --disable-log-requests \ --trust-remote-code \ --dtype bfloat16 \ --kv-cache-dtype auto \ --max-parallel-loading-workers 4 - Step 13
Configure Sampling Parameters
Control generation behavior with sampling parameters that affect output quality, diversity, and speed. These can be set per-request or as server defaults.
from vllm import SamplingParams # Deterministic sampling (temperature=0) sampling_params = SamplingParams( temperature=0.0, top_p=1.0, max_tokens=1024 ) # Creative sampling sampling_params = SamplingParams( temperature=0.9, top_p=0.95, top_k=50, max_tokens=2048, presence_penalty=0.1, frequency_penalty=0.1 ) # Constrained generation with stop sequences sampling_params = SamplingParams( temperature=0.7, max_tokens=512, stop=["\n\n", "```", "END"], skip_special_tokens=True ) - Step 14
Understanding PagedAttention
PagedAttention is vLLM's core innovation - it manages KV cache memory in non-contiguous blocks (like virtual memory paging) instead of pre-allocating large contiguous chunks. This eliminates memory fragmentation and enables dynamic memory allocation, resulting in 4x higher throughput compared to traditional approaches. Combined with continuous batching, vLLM maximizes GPU utilization by constantly filling available compute with new requests as others complete.
# Monitor PagedAttention block allocation # The server logs show block utilization in real-time python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct \ --block-size 16 \ --num-gpu-blocks-override 2048 \ --log-level debug # Key metrics to watch: # - GPU blocks used / total # - Number of running sequences # - Batched token throughput - Step 15
Production Deployment Best Practices
For production deployments, implement proper security, monitoring, and scaling strategies. Use a reverse proxy for TLS termination, implement request rate limiting, and monitor GPU utilization and error rates.
# NGINX reverse proxy configuration server { listen 443 ssl http2; server_name api.yourdomain.com; ssl_certificate /path/to/cert.pem; ssl_certificate_key /path/to/key.pem; location / { proxy_pass http://localhost:8000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; # Increase timeout for long generations proxy_read_timeout 300s; proxy_connect_timeout 75s; } } - Step 16
Monitoring and Observability
vLLM exposes Prometheus metrics for monitoring server health, request latency, GPU utilization, and throughput. Integrate with your existing observability stack for production monitoring.
# Enable Prometheus metrics endpoint python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct \ --disable-log-requests \ --uvicorn-log-level warning # Metrics available at http://localhost:8000/metrics curl http://localhost:8000/metrics # Key metrics: # - vllm:num_requests_running # - vllm:num_requests_waiting # - vllm:gpu_cache_usage_perc # - vllm:time_to_first_token_seconds # - vllm:time_per_output_token_seconds - Step 17
Benchmarking Performance
Measure your vLLM deployment's throughput and latency characteristics using the built-in benchmark tool. This helps tune configuration parameters for your specific hardware and workload.
# Install benchmark dependencies pip install vllm[benchmark] # Run throughput benchmark python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct & sleep 30 # Wait for server startup # Benchmark with various request rates curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.2-3B-Instruct", "prompt": "Once upon a time", "max_tokens": 100 }' & # Use Apache Bench for load testing ab -n 1000 -c 10 -p request.json \ -T application/json \ http://localhost:8000/v1/completions - Step 18
Common Use Cases and Patterns
vLLM excels at various LLM serving scenarios. Here are practical examples for common production patterns including batch processing, real-time chat, and function calling.
# Use Case 1: Batch document summarization from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct") documents = [...] # Load your documents prompts = [f"Summarize: {doc}" for doc in documents] outputs = llm.generate(prompts, SamplingParams(max_tokens=200)) # Use Case 2: Real-time streaming chat from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-key") for chunk in client.chat.completions.create( model="meta-llama/Llama-3.2-3B-Instruct", messages=[{"role": "user", "content": "Hello!"}], stream=True ): print(chunk.choices[0].delta.content or "", end="") # Use Case 3: Embeddings generation response = client.embeddings.create( model="BAAI/bge-large-en-v1.5", input="Your text to embed" ) vector = response.data[0].embedding - Step 19
Troubleshooting Common Issues
Solutions to frequently encountered problems when deploying vLLM. Most issues relate to GPU memory, CUDA compatibility, or model loading errors.
# Issue: CUDA out of memory # Solution: Reduce gpu-memory-utilization or max-num-seqs python -m vllm.entrypoints.openai.api_server \ --model <model> \ --gpu-memory-utilization 0.8 \ --max-num-seqs 128 # Issue: Model download fails # Solution: Set Hugging Face token export HUGGING_FACE_HUB_TOKEN=<your-token> # Issue: Slow first request # Solution: This is expected - vLLM loads model on first request # Use --preload-models to load at startup # Issue: CUDA version mismatch # Solution: Check CUDA version and install matching vLLM build nvcc --version pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121 # Enable debug logging python -m vllm.entrypoints.openai.api_server \ --model <model> \ --log-level debug⚠ Heads up: If you encounter 'ImportError: cannot import name X from vllm', your installation may be corrupted. Uninstall with pip uninstall vllm -y and reinstall from scratch. - Step 20
Updating vLLM
vLLM is actively developed with frequent releases adding new features, model support, and performance improvements. Stay updated to benefit from the latest optimizations and bug fixes.
# Update via pip pip install --upgrade vllm # Update Docker image docker pull vllm/vllm-openai:latest # Update from source cd vllm git pull origin main pip install -e . # Check current version python -c "import vllm; print(vllm.__version__)" # View changelog curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | grep body
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.