Intermediateprometheusmonitoringmetricstime-seriesalertingobservabilitycncfdevopsinfrastructurekubernetes

Prometheus - Monitoring System and Time Series Database

Install and configure Prometheus, the leading open-source monitoring and alerting toolkit - covering installation, configuration, PromQL queries, alerting with Alertmanager, and common use cases.

Step 1
Overview
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Now a Cloud Native Computing Foundation (CNCF) project, it's one of the most popular monitoring solutions for cloud-native applications.

Key capabilities:
- Multi-dimensional data model: Time series data identified by metric name and key/value pairs
- PromQL: Powerful and flexible query language to leverage dimensionality
- No distributed storage: Single server nodes are autonomous
- Pull model: HTTP pull model for time series collection
- Service discovery: Automatic target discovery via service discovery or static configuration
- Alerting: Alertmanager for managing alerts and notifications
- Visualization: Built-in PromQL explorer and integration with Grafana
Why Prometheus:
- CNCF Graduate: Production-ready with massive community support
- Kubernetes-native: First-class support for Kubernetes monitoring
- Rich ecosystem: Thousands of exporters for various systems and services
- Flexible: Works for infrastructure, application, and business metrics
- Time-series native: Optimized for metrics with sub-second resolution
```
Official site: https://prometheus.io
GitHub: https://github.com/prometheus/prometheus (64K+ stars)
Documentation: https://prometheus.io/docs/introduction/overview/
CNCF Project: https://www.cncf.io/projects/prometheus/
```

Step 2

Quick Installation Options

Multiple installation methods available depending on your environment:

Installation options:

Docker: Quick start for development and testing
Binary: Precompiled binaries for production
Helm: Kubernetes deployment with Chart
Source: Build from source for customization
Package managers: Available for major Linux distributions

# Option 1: Docker (quick start)
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v "$PWD/prometheus.yml:/etc/prometheus/prometheus.yml" \
  prom/prometheus:latest

# Access at http://localhost:9090

# Option 2: Binary (Linux)
wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar xvfz prometheus-2.52.0.linux-amd64.tar.gz
cd prometheus-2.52.0.linux-amd64
./prometheus --config.file=prometheus.yml --web.listen-address=:9090

# Option 3: Homebrew (macOS)
brew install prometheus
brew services start prometheus

# Option 4: APT (Debian/Ubuntu)
wget https://apt.kitops.io/debian/cloudkit-release-jammy.deb
dpkg -i cloudkit-release-jammy.deb
apt update
apt install prometheus

# Option 5: Helm (Kubernetes)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus

# Verify installation
curl http://localhost:9090/-/healthy
# Output: OK

Step 3

Building from Source

Build Prometheus from source for the latest features or custom configurations. This requires Go, Node.js, and npm.

Prerequisites:

Go: Version specified in go.mod or greater
Node.js: Version specified in .nvmrc or greater
npm: Version 10 or greater

# Install dependencies (Ubuntu/Debian)
sudo apt update
sudo apt install -y git gcc g++ make wget curl

# Install Go (version from go.mod)
wget https://go.dev/dl/go1.22.0.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.22.0.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin

# Install Node.js and npm
wget https://nodejs.org/dist/v20.10.0/node-v20.10.0-linux-x64.tar.xz
tar -xf node-v20.10.0-linux-x64.tar.xz
sudo mv node-v20.10.0-linux-x64 /usr/local/node
sudo ln -s /usr/local/node/bin/node /usr/local/bin/node
sudo ln -s /usr/local/node/bin/npm /usr/local/bin/npm

# Clone repository
git clone https://github.com/prometheus/prometheus.git
cd prometheus

# Build with web assets
make build

# Run Prometheus
./prometheus --config.file=prometheus.yml

# Verify
curl http://localhost:9090/-/healthy
# Output: OK

Step 4

Basic Configuration

Prometheus uses a YAML configuration file to define global settings, alerting, and scraping jobs.

Key configuration sections:

global: Global scrape and evaluation intervals
alerting: Alertmanager connection settings
rule_files: Alert and recording rule files
scrape_configs: Target discovery and scraping configuration

# prometheus.yml - Basic configuration

global:
  scrape_interval: 15s     # How often to scrape targets
  evaluation_interval: 15s # How often to evaluate rules
  scrape_timeout: 10s      # Per-target scrape timeout

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

# Alert rules
rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

# Scrape configurations
scrape_configs:
  # Scrape Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets:
          - localhost:9090
        labels:
          app: "prometheus"

  # Scrape Node Exporter (system metrics)
  - job_name: "node"
    static_configs:
      - targets:
          - node1:9100
          - node2:9100
        labels:
          group: "production"

  # Scrape application metrics
  - job_name: "my-app"
    static_configs:
      - targets:
          - app-server:8080
        labels:
          env: "production"

  # File-based service discovery
  - job_name: "file-sd"
    file_sd_configs:
      - files:
          - "/etc/prometheus/targets/*.yaml"

  # Relabeling example
  - job_name: "relabel-example"
    static_configs:
      - targets:
          - localhost:9100
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: "(.*)"
        replacement: "${1}"
        action: replace

Step 5

Service Discovery

Prometheus supports multiple service discovery mechanisms for dynamic target detection:

Service discovery types:

Static: Fixed list of targets in config
File-based: Targets from local JSON/YAML files
DNS: DNS SRV record discovery
Consul: Consul catalog discovery
Docker: Docker containers via API
Kubernetes: Native K8s service discovery
AWS: EC2, EKS, ECS, Lambda discovery
Azure: VM discovery
GCE: Google Compute Engine discovery

# Kubernetes service discovery (native)
scrape_configs:
  # Discover all pods with prometheus.io/scrape annotation
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Use annotation for metrics path
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # Use annotation for port
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+);(\d+)
        replacement: $1:$2
        target_label: __address__

      # Add pod and namespace labels
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

# Consul service discovery
- job_name: "consul-services"
  consul_sd_configs:
    - server: consul.example.com:8500
      datacenter: dc1
      tag_separator: ,
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      action: keep
      regex: ".*,prometheus,.+"

# Docker service discovery
- job_name: "docker-containers"
  docker_sd_configs:
    - host: unix:///var/run/docker.sock
      refresh_interval: 15s
  relabel_configs:
    - source_labels: [__meta_docker_container_label_prometheus_scrape]
      action: keep
      regex: "true"

# AWS EC2 discovery
- job_name: "aws-ec2"
  aws_sd_configs:
    - region: us-east-1
      port: 9100
  relabel_configs:
    - source_labels: [__meta_aws_tag_Name]
      action: keep
      regex: "prometheus.*"

Step 6

PromQL Basics

PromQL (Prometheus Query Language) is powerful and flexible. Start with basic queries to understand the data model.

Query types:

Instant: Return data for current timestamp
Range: Return data over a time range
Aggregation: Combine multiple time series
Recording: Pre-compute complex queries

# Access Prometheus web UI at http://localhost:9090/graph
# Then use these queries in the expression field:

# Basic metric queries
node_cpu_seconds_total{mode="idle"}
# Get all CPU idle metrics

up{job="prometheus"}
# Check if prometheus targets are up (1=up, 0=down)

# Instant vector - current values
rate(node_cpu_seconds_total{mode!="idle"}[5m])
# CPU usage over last 5 minutes

# Range vector - time series over time
node_memory_MemAvailable_bytes[1h]
# Memory available over last hour

# Label filtering
http_requests_total{status="200", method="GET"}
# Filter by status and method labels

# Label matching operators
# =  Exact match
# != Not equal
# =~ Regex match (case-sensitive)
# !~ Not regex match (case-sensitive)

# Example: Get all metrics for production namespace
kube_pod_container_status_ready{namespace="production"}

# Get metrics matching multiple patterns
node_filesystem_size_bytes{mountpoint=~"/(var|etc)"}

Step 7

PromQL Functions

PromQL provides rich functions for time series manipulation:

Common functions:

rate(): Per-second average over a range
increase(): Total increase over a range
avg(): Average across series
sum(): Sum across series
count(): Count of series
histogram_quantile(): Calculate percentiles

# Rate calculations (most common)
rate(http_requests_total[5m])
# Requests per second over last 5 minutes

rate(node_cpu_seconds_total[5m])
# CPU utilization per core

# Increase (total count)
increase(http_requests_total[1h])
# Total requests in last hour

# Aggregations
sum(rate(http_requests_total[5m]))
# Total RPS across all instances

sum by (job) (rate(http_requests_total[5m]))
# RPS grouped by job

avg by (instance) (node_memory_MemUsed_bytes)
# Average memory per instance

# Percentiles (for histograms)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# 95th percentile request duration

# Derived metrics
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Memory usage percentage

# Comparisons
up == 0
# Find down targets

# Time-based functions
timestamp() - time()
# Current Unix timestamp

# Series grouping
sum without (instance) (rate(node_network_receive_bytes_total[5m]))
# Sum network bytes, ignore instance label

Step 8

Recording Rules

Recording rules pre-compute complex queries and store them as new time series. Useful for performance and to avoid repetitive queries.

Benefits:

Pre-compute expensive queries
Reuse common expressions
Reduce query complexity
Improve alert reliability

# recording_rules.yml
groups:
  - name: app_recording_rules
    interval: 30s
    rules:
      # Pre-compute RPS for all endpoints
      - record: http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (endpoint)

      # Pre-compute error rate
      - record: http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)

      # Pre-compute memory usage
      - record: instance:memory_usage:ratio
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

      # Pre-compute CPU usage
      - record: instance:cpu_usage:ratio
        expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

      # 95th percentile latency
      - record: api:latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# Alert rules
  - name: app_alert_rules
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: http_errors:rate5m > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}% for endpoint {{ $labels.endpoint }}"

      # High memory usage
      - alert: HighMemoryUsage
        expr: instance:memory_usage:ratio > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanizePercentage }}"

      # Target down
      - alert: TargetDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus target {{ $labels.job }} is down"
          description: "Target {{ $labels.instance }} has been down for more than 2 minutes"

Step 9

Alertmanager Configuration

Alertmanager handles alerts from Prometheus: deduplication, grouping, silencing, and routing to notification channels.

Features:

Deduplication: Merge duplicate alerts
Grouping: Combine related alerts
Inhibition: Silence dependent alerts
Routing: Route alerts to different receivers
Silencing: Temporary alert suppression

# alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alertmanager@example.com"
  smtp_auth_username: "alertmanager"
  smtp_auth_password: "password"

# Default receivers
templates:
  - "/etc/alertmanager/templates/*.tmpl"

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

# Alert routing
route:
  group_by: ['alertname', 'severity', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    # Critical alerts go to pagerduty
    - match:
        severity: critical
      receiver: pagerduty-critical
      continue: true
    
    # Warning alerts to slack
    - match:
        severity: warning
      receiver: slack-warnings
    
    # Database alerts to specific channel
    - match:
        alertname: DatabaseConnectionError
      receiver: db-team
    
    # Kubernetes alerts
    - match:
        kubernetes: 'true'
      receiver: k8s-team

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-service-key'
        description: '{{ .CommonAnnotations.summary }}'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true

  - name: 'db-team'
    email_configs:
      - to: 'dba@example.com'

  - name: 'k8s-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#k8s-alerts'

Step 10

Node Exporter Installation

Node Exporter exposes hardware and OS metrics. Essential for infrastructure monitoring.

Metrics collected:

CPU, memory, disk, network
Filesystem usage
Load averages
Process information
Hardware sensors

# Option 1: Docker
docker run -d \
  --name node_exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter:latest \
  --path.rootfs=/host

# Option 2: Binary (Linux)
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.1.linux-amd64.tar.gz
cd node_exporter-1.8.1.linux-amd64

# Create systemd service
cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=root
ExecStart=/opt/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

# Verify
curl http://localhost:9100/metrics | head -20

# Common metrics to query
node_cpu_seconds_total{mode!="idle"}
node_memory_MemTotal_bytes
node_filesystem_avail_bytes
node_network_receive_bytes_total
node_load1
node_boot_time_seconds

Step 11

Prometheus for Kubernetes

Prometheus is the de facto standard for Kubernetes monitoring. The prometheus-community Helm chart provides a full-featured installation.

What gets deployed:

Prometheus server
Alertmanager
Node Exporter DaemonSet
kube-state-metrics
ServiceMonitors/PodMonitors CRDs

# Install Prometheus Operator CRDs
kubectl apply -f https://github.com/prometheus-operator/prometheus-operator/releases/download/v0.72.0/bundle.yaml

# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install with default values
helm install prometheus prometheus-community/prometheus --namespace monitoring --create-namespace

# Install with custom values
helm install prometheus prometheus-community/prometheus \
  --namespace monitoring \
  --create-namespace \
  --set server.persistentVolume.enabled=true \
  --set server.retention=15d \
  --set alertmanager.enabled=true \
  --set prometheusOperator.enabled=true \
  --values values.yaml

# Example values.yaml
# values.yaml
server:
  persistentVolume:
    enabled: true
    size: 50Gi
  retention: 15d
  resources:
    limits:
      memory: 2Gi
    requests:
      memory: 1Gi

alertmanager:
  enabled: true
  persistence:
    enabled: true
    size: 1Gi

prometheusOperator:
  enabled: true
  admissionWebhooks:
    patch:
      image:
        repository: quay.io/prometheus-operator/admission-webhooks
        tag: v20240202-debian

prometheusConfigReloader:
  enabled: true

serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false

# Access via port-forward
kubectl port-forward -n monitoring svc/prometheus 9090:9090

# Access via Ingress (enable in values.yaml)
# server:
#   ingress:
#     enabled: true
#     hosts:
#       - prometheus.example.com

Step 12

Common Exporters

The Prometheus ecosystem includes exporters for virtually any system or service:

Popular exporters:

node_exporter: Hardware/OS metrics
mysqld_exporter: MySQL database metrics
postgres_exporter: PostgreSQL metrics
redis_exporter: Redis metrics
nginx_exporter: NGINX metrics
cadvisor: Container metrics
blackbox_exporter: Probing/health checks
snmp_exporter: SNMP device metrics

# prometheus.yml - Exporter scrape configs

scrape_configs:
  # MySQL Exporter
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']
    metrics_path: '/metrics'

  # PostgreSQL Exporter  
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  # Redis Exporter
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

  # Nginx Exporter
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']

  # cAdvisor (container metrics)
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  # Blackbox Exporter (probing)
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a 2xx response
    static_configs:
      - targets:
          - https://prometheus.io
          - https://github.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

# Docker run examples for exporters:
# docker run -d -p 9104:9104 -e DATA_SOURCE_NAME="user:pass@tcp(mysql:3306)/" prom/mysqld_exporter
# docker run -d -p 9187:9187 -e DATA_SOURCE_URI="postgres:5432/default?sslmode=disable" prometheuscommunity/postgres-exporter
# docker run -d -p 9121:9121 -e REDIS_ADDR="redis:6379" oliver006/redis_exporter
# docker run -d -p 9113:9113 -e NGINX_URI="http://nginx:80" nginx/nginx-prometheus-exporter
# docker run -d -p 8080:8080 --volume /var/run/docker.sock:/var/run/docker.sock google/cadvisor

Step 13

Grafana Integration

Grafana provides powerful visualization for Prometheus data. The combination is the industry standard for observability.

Features:

Beautiful dashboards
Alerting rules
Multi-data source support
Variable interpolation
Templated queries

# Install Grafana with Prometheus data source

# Docker Compose example
cat > docker-compose.yml << 'EOF'
version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom_data:/prometheus

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus

volumes:
  prom_data:
  grafana_data:
EOF

# Auto-provision Prometheus data source in Grafana
cat > provisioning/datasources/datasources.yml << 'EOF'
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
EOF

docker-compose up -d

# Access Grafana at http://localhost:3000
# Login: admin/admin

# Import dashboards from Grafana.com:
# Node Exporter Full: https://grafana.com/grafana/dashboards/1860
# Prometheus Metrics: https://grafana.com/grafana/dashboards/2
# Kubernetes Cluster: https://grafana.com/grafana/dashboards/315

# Save dashboard JSON and import via:
# Grafana UI → Dashboards → Import

Step 14

High Availability Setup

For production, configure Prometheus for high availability:

HA considerations:

Multiple Prometheus instances
Remote write for backup
Thanos for long-term storage
Fedration for horizontal scaling

# HA configuration with remote write
global:
  scrape_interval: 15s
  evaluation_interval: 15s

remote_write:
  - url: http://thanos-receive-1:19291/api/v1/write
    timeout: 30s
    send_exemplars: true
  - url: http://thanos-receive-2:19291/api/v1/write
    timeout: 30s

# Federation for scaling
federation:
  - url: http://prometheus-2:9090
    name: prometheus2
    timeout: 30s

# Thanos sidecar deployment
cat > thanos-sidecar.yml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-sidecar
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:latest
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
        - name: thanos-sidecar
          image: quay.io/thanos/thanos:v0.34.0
          args:
            - 'sidecar'
            - '--prometheus.url=http://localhost:9090'
            - '--objstore.config-file=/etc/thanos/objstore.yml'
EOF

# Objection store config for S3
cat > objstore.yml << 'EOF'
type: S3
config:
  bucket: prometheus-data
  region: us-east-1
  insecure: false
EOF

Step 15
Resources & Next Steps
Documentation:
Community:
Tools:
- Grafana - Visualization
- Alertmanager - Alerting
- Thanos - Long-term storage
- VictoriaMetrics - Alternative TSDB
Next guides:
- Prometheus for Kubernetes with kube-prometheus-stack
- Alertmanager advanced routing and silencing
- Thanos for multi-tenant Prometheus
- Recording rules and performance optimization
```
GitHub: https://github.com/prometheus/prometheus
Official site: https://prometheus.io
Documentation: https://prometheus.io/docs/
Prometheus Slack: https://slack.prometheus.io/
Grafana: https://grafana.com/
Thanos: https://thanos.io/
VictoriaMetrics: https://victoriametrics.com/
```