Prometheus - Monitoring System and Time Series Database
Install and configure Prometheus, the leading open-source monitoring and alerting toolkit - covering installation, configuration, PromQL queries, alerting with Alertmanager, and common use cases.
- Step 1
Overview
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Now a Cloud Native Computing Foundation (CNCF) project, it's one of the most popular monitoring solutions for cloud-native applications.
Key capabilities:
- Multi-dimensional data model: Time series data identified by metric name and key/value pairs
- PromQL: Powerful and flexible query language to leverage dimensionality
- No distributed storage: Single server nodes are autonomous
- Pull model: HTTP pull model for time series collection
- Service discovery: Automatic target discovery via service discovery or static configuration
- Alerting: Alertmanager for managing alerts and notifications
- Visualization: Built-in PromQL explorer and integration with Grafana
Why Prometheus:
- CNCF Graduate: Production-ready with massive community support
- Kubernetes-native: First-class support for Kubernetes monitoring
- Rich ecosystem: Thousands of exporters for various systems and services
- Flexible: Works for infrastructure, application, and business metrics
- Time-series native: Optimized for metrics with sub-second resolution
Official site: https://prometheus.io GitHub: https://github.com/prometheus/prometheus (64K+ stars) Documentation: https://prometheus.io/docs/introduction/overview/ CNCF Project: https://www.cncf.io/projects/prometheus/ - Step 2
Quick Installation Options
Multiple installation methods available depending on your environment:
Installation options:
- Docker: Quick start for development and testing
- Binary: Precompiled binaries for production
- Helm: Kubernetes deployment with Chart
- Source: Build from source for customization
- Package managers: Available for major Linux distributions
# Option 1: Docker (quick start) docker run -d \ --name prometheus \ -p 9090:9090 \ -v "$PWD/prometheus.yml:/etc/prometheus/prometheus.yml" \ prom/prometheus:latest # Access at http://localhost:9090 # Option 2: Binary (Linux) wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz tar xvfz prometheus-2.52.0.linux-amd64.tar.gz cd prometheus-2.52.0.linux-amd64 ./prometheus --config.file=prometheus.yml --web.listen-address=:9090 # Option 3: Homebrew (macOS) brew install prometheus brew services start prometheus # Option 4: APT (Debian/Ubuntu) wget https://apt.kitops.io/debian/cloudkit-release-jammy.deb dpkg -i cloudkit-release-jammy.deb apt update apt install prometheus # Option 5: Helm (Kubernetes) helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/prometheus # Verify installation curl http://localhost:9090/-/healthy # Output: OK - Step 3
Building from Source
Build Prometheus from source for the latest features or custom configurations. This requires Go, Node.js, and npm.
Prerequisites:
- Go: Version specified in go.mod or greater
- Node.js: Version specified in .nvmrc or greater
- npm: Version 10 or greater
# Install dependencies (Ubuntu/Debian) sudo apt update sudo apt install -y git gcc g++ make wget curl # Install Go (version from go.mod) wget https://go.dev/dl/go1.22.0.linux-amd64.tar.gz sudo tar -C /usr/local -xzf go1.22.0.linux-amd64.tar.gz export PATH=$PATH:/usr/local/go/bin # Install Node.js and npm wget https://nodejs.org/dist/v20.10.0/node-v20.10.0-linux-x64.tar.xz tar -xf node-v20.10.0-linux-x64.tar.xz sudo mv node-v20.10.0-linux-x64 /usr/local/node sudo ln -s /usr/local/node/bin/node /usr/local/bin/node sudo ln -s /usr/local/node/bin/npm /usr/local/bin/npm # Clone repository git clone https://github.com/prometheus/prometheus.git cd prometheus # Build with web assets make build # Run Prometheus ./prometheus --config.file=prometheus.yml # Verify curl http://localhost:9090/-/healthy # Output: OK - Step 4
Basic Configuration
Prometheus uses a YAML configuration file to define global settings, alerting, and scraping jobs.
Key configuration sections:
- global: Global scrape and evaluation intervals
- alerting: Alertmanager connection settings
- rule_files: Alert and recording rule files
- scrape_configs: Target discovery and scraping configuration
# prometheus.yml - Basic configuration global: scrape_interval: 15s # How often to scrape targets evaluation_interval: 15s # How often to evaluate rules scrape_timeout: 10s # Per-target scrape timeout # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # Alert rules rule_files: - "alert_rules.yml" - "recording_rules.yml" # Scrape configurations scrape_configs: # Scrape Prometheus itself - job_name: "prometheus" static_configs: - targets: - localhost:9090 labels: app: "prometheus" # Scrape Node Exporter (system metrics) - job_name: "node" static_configs: - targets: - node1:9100 - node2:9100 labels: group: "production" # Scrape application metrics - job_name: "my-app" static_configs: - targets: - app-server:8080 labels: env: "production" # File-based service discovery - job_name: "file-sd" file_sd_configs: - files: - "/etc/prometheus/targets/*.yaml" # Relabeling example - job_name: "relabel-example" static_configs: - targets: - localhost:9100 relabel_configs: - source_labels: [__address__] target_label: instance regex: "(.*)" replacement: "${1}" action: replace - Step 5
Service Discovery
Prometheus supports multiple service discovery mechanisms for dynamic target detection:
Service discovery types:
- Static: Fixed list of targets in config
- File-based: Targets from local JSON/YAML files
- DNS: DNS SRV record discovery
- Consul: Consul catalog discovery
- Docker: Docker containers via API
- Kubernetes: Native K8s service discovery
- AWS: EC2, EKS, ECS, Lambda discovery
- Azure: VM discovery
- GCE: Google Compute Engine discovery
# Kubernetes service discovery (native) scrape_configs: # Discover all pods with prometheus.io/scrape annotation - job_name: "kubernetes-pods" kubernetes_sd_configs: - role: pod relabel_configs: # Only scrape pods with prometheus.io/scrape: "true" - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true # Use annotation for metrics path - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) # Use annotation for port - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+);(\d+) replacement: $1:$2 target_label: __address__ # Add pod and namespace labels - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod # Consul service discovery - job_name: "consul-services" consul_sd_configs: - server: consul.example.com:8500 datacenter: dc1 tag_separator: , relabel_configs: - source_labels: [__meta_consul_tags] action: keep regex: ".*,prometheus,.+" # Docker service discovery - job_name: "docker-containers" docker_sd_configs: - host: unix:///var/run/docker.sock refresh_interval: 15s relabel_configs: - source_labels: [__meta_docker_container_label_prometheus_scrape] action: keep regex: "true" # AWS EC2 discovery - job_name: "aws-ec2" aws_sd_configs: - region: us-east-1 port: 9100 relabel_configs: - source_labels: [__meta_aws_tag_Name] action: keep regex: "prometheus.*" - Step 6
PromQL Basics
PromQL (Prometheus Query Language) is powerful and flexible. Start with basic queries to understand the data model.
Query types:
- Instant: Return data for current timestamp
- Range: Return data over a time range
- Aggregation: Combine multiple time series
- Recording: Pre-compute complex queries
# Access Prometheus web UI at http://localhost:9090/graph # Then use these queries in the expression field: # Basic metric queries node_cpu_seconds_total{mode="idle"} # Get all CPU idle metrics up{job="prometheus"} # Check if prometheus targets are up (1=up, 0=down) # Instant vector - current values rate(node_cpu_seconds_total{mode!="idle"}[5m]) # CPU usage over last 5 minutes # Range vector - time series over time node_memory_MemAvailable_bytes[1h] # Memory available over last hour # Label filtering http_requests_total{status="200", method="GET"} # Filter by status and method labels # Label matching operators # = Exact match # != Not equal # =~ Regex match (case-sensitive) # !~ Not regex match (case-sensitive) # Example: Get all metrics for production namespace kube_pod_container_status_ready{namespace="production"} # Get metrics matching multiple patterns node_filesystem_size_bytes{mountpoint=~"/(var|etc)"} - Step 7
PromQL Functions
PromQL provides rich functions for time series manipulation:
Common functions:
- rate(): Per-second average over a range
- increase(): Total increase over a range
- avg(): Average across series
- sum(): Sum across series
- count(): Count of series
- histogram_quantile(): Calculate percentiles
# Rate calculations (most common) rate(http_requests_total[5m]) # Requests per second over last 5 minutes rate(node_cpu_seconds_total[5m]) # CPU utilization per core # Increase (total count) increase(http_requests_total[1h]) # Total requests in last hour # Aggregations sum(rate(http_requests_total[5m])) # Total RPS across all instances sum by (job) (rate(http_requests_total[5m])) # RPS grouped by job avg by (instance) (node_memory_MemUsed_bytes) # Average memory per instance # Percentiles (for histograms) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) # 95th percentile request duration # Derived metrics (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 # Memory usage percentage # Comparisons up == 0 # Find down targets # Time-based functions timestamp() - time() # Current Unix timestamp # Series grouping sum without (instance) (rate(node_network_receive_bytes_total[5m])) # Sum network bytes, ignore instance label - Step 8
Recording Rules
Recording rules pre-compute complex queries and store them as new time series. Useful for performance and to avoid repetitive queries.
Benefits:
- Pre-compute expensive queries
- Reuse common expressions
- Reduce query complexity
- Improve alert reliability
# recording_rules.yml groups: - name: app_recording_rules interval: 30s rules: # Pre-compute RPS for all endpoints - record: http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (endpoint) # Pre-compute error rate - record: http_errors:rate5m expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint) # Pre-compute memory usage - record: instance:memory_usage:ratio expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes # Pre-compute CPU usage - record: instance:cpu_usage:ratio expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) # 95th percentile latency - record: api:latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)) # Alert rules - name: app_alert_rules rules: # High error rate - alert: HighErrorRate expr: http_errors:rate5m > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value }}% for endpoint {{ $labels.endpoint }}" # High memory usage - alert: HighMemoryUsage expr: instance:memory_usage:ratio > 0.9 for: 10m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}" # Target down - alert: TargetDown expr: up == 0 for: 2m labels: severity: critical annotations: summary: "Prometheus target {{ $labels.job }} is down" description: "Target {{ $labels.instance }} has been down for more than 2 minutes" - Step 9
Alertmanager Configuration
Alertmanager handles alerts from Prometheus: deduplication, grouping, silencing, and routing to notification channels.
Features:
- Deduplication: Merge duplicate alerts
- Grouping: Combine related alerts
- Inhibition: Silence dependent alerts
- Routing: Route alerts to different receivers
- Silencing: Temporary alert suppression
# alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: "smtp.example.com:587" smtp_from: "alertmanager@example.com" smtp_auth_username: "alertmanager" smtp_auth_password: "password" # Default receivers templates: - "/etc/alertmanager/templates/*.tmpl" inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance'] # Alert routing route: group_by: ['alertname', 'severity', 'instance'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default' routes: # Critical alerts go to pagerduty - match: severity: critical receiver: pagerduty-critical continue: true # Warning alerts to slack - match: severity: warning receiver: slack-warnings # Database alerts to specific channel - match: alertname: DatabaseConnectionError receiver: db-team # Kubernetes alerts - match: kubernetes: 'true' receiver: k8s-team receivers: - name: 'default' email_configs: - to: 'ops@example.com' send_resolved: true - name: 'pagerduty-critical' pagerduty_configs: - service_key: 'your-pagerduty-service-key' description: '{{ .CommonAnnotations.summary }}' - name: 'slack-warnings' slack_configs: - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz' channel: '#alerts' title: '{{ .CommonAnnotations.summary }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true - name: 'db-team' email_configs: - to: 'dba@example.com' - name: 'k8s-team' slack_configs: - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz' channel: '#k8s-alerts' - Step 10
Node Exporter Installation
Node Exporter exposes hardware and OS metrics. Essential for infrastructure monitoring.
Metrics collected:
- CPU, memory, disk, network
- Filesystem usage
- Load averages
- Process information
- Hardware sensors
# Option 1: Docker docker run -d \ --name node_exporter \ --net="host" \ --pid="host" \ -v "/:/host:ro,rslave" \ prom/node-exporter:latest \ --path.rootfs=/host # Option 2: Binary (Linux) wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz tar xvfz node_exporter-1.8.1.linux-amd64.tar.gz cd node_exporter-1.8.1.linux-amd64 # Create systemd service cat > /etc/systemd/system/node_exporter.service << EOF [Unit] Description=Node Exporter After=network.target [Service] User=node_exporter Group=root ExecStart=/opt/node_exporter/node_exporter [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable node_exporter systemctl start node_exporter # Verify curl http://localhost:9100/metrics | head -20 # Common metrics to query node_cpu_seconds_total{mode!="idle"} node_memory_MemTotal_bytes node_filesystem_avail_bytes node_network_receive_bytes_total node_load1 node_boot_time_seconds - Step 11
Prometheus for Kubernetes
Prometheus is the de facto standard for Kubernetes monitoring. The prometheus-community Helm chart provides a full-featured installation.
What gets deployed:
- Prometheus server
- Alertmanager
- Node Exporter DaemonSet
- kube-state-metrics
- ServiceMonitors/PodMonitors CRDs
# Install Prometheus Operator CRDs kubectl apply -f https://github.com/prometheus-operator/prometheus-operator/releases/download/v0.72.0/bundle.yaml # Add Helm repo helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Install with default values helm install prometheus prometheus-community/prometheus --namespace monitoring --create-namespace # Install with custom values helm install prometheus prometheus-community/prometheus \ --namespace monitoring \ --create-namespace \ --set server.persistentVolume.enabled=true \ --set server.retention=15d \ --set alertmanager.enabled=true \ --set prometheusOperator.enabled=true \ --values values.yaml # Example values.yaml # values.yaml server: persistentVolume: enabled: true size: 50Gi retention: 15d resources: limits: memory: 2Gi requests: memory: 1Gi alertmanager: enabled: true persistence: enabled: true size: 1Gi prometheusOperator: enabled: true admissionWebhooks: patch: image: repository: quay.io/prometheus-operator/admission-webhooks tag: v20240202-debian prometheusConfigReloader: enabled: true serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false # Access via port-forward kubectl port-forward -n monitoring svc/prometheus 9090:9090 # Access via Ingress (enable in values.yaml) # server: # ingress: # enabled: true # hosts: # - prometheus.example.com - Step 12
Common Exporters
The Prometheus ecosystem includes exporters for virtually any system or service:
Popular exporters:
- node_exporter: Hardware/OS metrics
- mysqld_exporter: MySQL database metrics
- postgres_exporter: PostgreSQL metrics
- redis_exporter: Redis metrics
- nginx_exporter: NGINX metrics
- cadvisor: Container metrics
- blackbox_exporter: Probing/health checks
- snmp_exporter: SNMP device metrics
# prometheus.yml - Exporter scrape configs scrape_configs: # MySQL Exporter - job_name: 'mysql' static_configs: - targets: ['mysql-exporter:9104'] metrics_path: '/metrics' # PostgreSQL Exporter - job_name: 'postgres' static_configs: - targets: ['postgres-exporter:9187'] # Redis Exporter - job_name: 'redis' static_configs: - targets: ['redis-exporter:9121'] # Nginx Exporter - job_name: 'nginx' static_configs: - targets: ['nginx-exporter:9113'] # cAdvisor (container metrics) - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080'] # Blackbox Exporter (probing) - job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] # Look for a 2xx response static_configs: - targets: - https://prometheus.io - https://github.com relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115 # Docker run examples for exporters: # docker run -d -p 9104:9104 -e DATA_SOURCE_NAME="user:pass@tcp(mysql:3306)/" prom/mysqld_exporter # docker run -d -p 9187:9187 -e DATA_SOURCE_URI="postgres:5432/default?sslmode=disable" prometheuscommunity/postgres-exporter # docker run -d -p 9121:9121 -e REDIS_ADDR="redis:6379" oliver006/redis_exporter # docker run -d -p 9113:9113 -e NGINX_URI="http://nginx:80" nginx/nginx-prometheus-exporter # docker run -d -p 8080:8080 --volume /var/run/docker.sock:/var/run/docker.sock google/cadvisor - Step 13
Grafana Integration
Grafana provides powerful visualization for Prometheus data. The combination is the industry standard for observability.
Features:
- Beautiful dashboards
- Alerting rules
- Multi-data source support
- Variable interpolation
- Templated queries
# Install Grafana with Prometheus data source # Docker Compose example cat > docker-compose.yml << 'EOF' version: '3' services: prometheus: image: prom/prometheus:latest container_name: prometheus ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prom_data:/prometheus grafana: image: grafana/grafana:latest container_name: grafana ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin volumes: - grafana_data:/var/lib/grafana - ./provisioning:/etc/grafana/provisioning depends_on: - prometheus volumes: prom_data: grafana_data: EOF # Auto-provision Prometheus data source in Grafana cat > provisioning/datasources/datasources.yml << 'EOF' apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: true EOF docker-compose up -d # Access Grafana at http://localhost:3000 # Login: admin/admin # Import dashboards from Grafana.com: # Node Exporter Full: https://grafana.com/grafana/dashboards/1860 # Prometheus Metrics: https://grafana.com/grafana/dashboards/2 # Kubernetes Cluster: https://grafana.com/grafana/dashboards/315 # Save dashboard JSON and import via: # Grafana UI → Dashboards → Import - Step 14
High Availability Setup
For production, configure Prometheus for high availability:
HA considerations:
- Multiple Prometheus instances
- Remote write for backup
- Thanos for long-term storage
- Fedration for horizontal scaling
# HA configuration with remote write global: scrape_interval: 15s evaluation_interval: 15s remote_write: - url: http://thanos-receive-1:19291/api/v1/write timeout: 30s send_exemplars: true - url: http://thanos-receive-2:19291/api/v1/write timeout: 30s # Federation for scaling federation: - url: http://prometheus-2:9090 name: prometheus2 timeout: 30s # Thanos sidecar deployment cat > thanos-sidecar.yml << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: thanos-sidecar spec: replicas: 2 template: spec: containers: - name: prometheus image: prom/prometheus:latest args: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - name: thanos-sidecar image: quay.io/thanos/thanos:v0.34.0 args: - 'sidecar' - '--prometheus.url=http://localhost:9090' - '--objstore.config-file=/etc/thanos/objstore.yml' EOF # Objection store config for S3 cat > objstore.yml << 'EOF' type: S3 config: bucket: prometheus-data region: us-east-1 insecure: false EOF - Step 15
Resources & Next Steps
Documentation:
Community:
Tools:
- Grafana - Visualization
- Alertmanager - Alerting
- Thanos - Long-term storage
- VictoriaMetrics - Alternative TSDB
Next guides:
- Prometheus for Kubernetes with kube-prometheus-stack
- Alertmanager advanced routing and silencing
- Thanos for multi-tenant Prometheus
- Recording rules and performance optimization
GitHub: https://github.com/prometheus/prometheus Official site: https://prometheus.io Documentation: https://prometheus.io/docs/ Prometheus Slack: https://slack.prometheus.io/ Grafana: https://grafana.com/ Thanos: https://thanos.io/ VictoriaMetrics: https://victoriametrics.com/
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.