TechSetupGuides
Intermediateelasticsearchsearchanalyticslucenenosqldistributedfull-text-searchelk-stackobservabilitydata-store

Elasticsearch - Distributed Search and Analytics Engine

Install and configure Elasticsearch, the powerful open-source distributed RESTful search and analytics engine built on Apache Lucene - covering installation, indexing, search queries, aggregations, cluster management, and production best practices.

  1. Step 1

    Overview

    Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. Originally developed by Shay Banon in 2010, it has become the leading solution for full-text search, log analytics, and real-time data analysis. With 70,000+ GitHub stars, Elasticsearch powers search features for companies like Netflix, Uber, GitHub, and Wikipedia.

    Key capabilities:

    • Distributed architecture: Horizontal scaling across multiple nodes with automatic sharding and replication
    • Full-text search: Advanced text analysis with 40+ language analyzers and custom tokenizers
    • Real-time indexing: Near real-time data ingestion and search with sub-second latency
    • RESTful API: Simple JSON-based HTTP interface for all operations
    • Aggregations: Powerful analytics framework for metrics, bucketing, and pipeline aggregations
    • Schema-free JSON: Dynamic mapping with automatic field type detection
    • Multi-tenancy: Index-level isolation for multiple datasets in one cluster

    Why Elasticsearch:

    • Battle-tested: Powers some of the world's largest search deployments (billions of documents)
    • ELK Stack: Native integration with Logstash and Kibana for complete observability
    • Rich ecosystem: Clients for Java, Python, JavaScript, Go, Ruby, and 10+ languages
    • Scalable: Scales from single-node development to multi-datacenter production clusters
    • Versatile: Search engines, log analytics, metrics, security analytics, business intelligence
    Official site: https://www.elastic.co/elasticsearch
    GitHub: https://github.com/elastic/elasticsearch (70K+ stars)
    Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
    Download: https://www.elastic.co/downloads/elasticsearch
  2. Step 2

    Technology Stack

    Elasticsearch is built on a sophisticated stack of Java technologies optimized for distributed systems and high-performance search.

    Core platform:

    • Java 21+ (bundled with distribution)
    • Apache Lucene (text search library and inverted index engine)
    • Netty for async HTTP and transport layer
    • Jackson for JSON serialization
    • Log4j2 for structured logging

    Distributed systems:

    • Custom cluster coordination (formerly Zen Discovery, now based on Raft consensus)
    • Segment-based storage with automatic merge policies
    • Vector clock for distributed versioning
    • Cross-cluster replication for disaster recovery

    Data structures:

    • Inverted index for text search (term → document IDs)
    • Doc values for sorting and aggregations (column-oriented storage)
    • BKD trees for numeric and geo-spatial indexing
    • Finite state transducers (FST) for efficient term lookups

    Query execution:

    • Two-phase distributed search (query then fetch)
    • Scoring with TF-IDF and BM25 algorithms
    • Vector search with k-NN for semantic/machine learning use cases
    • Query cache and request cache for performance
    Architecture:
    ├── Core: Java 21, Apache Lucene
    ├── Network: Netty (HTTP + Transport)
    ├── Serialization: Jackson (JSON)
    ├── Coordination: Raft consensus
    └── Storage: Segment-based with LSM-tree patterns
    
    Data structures:
    ├── Inverted index (text search)
    ├── Doc values (aggregations)
    ├── BKD trees (numerics, geo)
    └── FST (term lookups)
    
    Query:
    ├── Distributed search (query → fetch)
    ├── Scoring: BM25 (default), TF-IDF
    └── Vector: k-NN, ANN algorithms
  3. Step 3

    Quick Installation Options

    Multiple installation methods available depending on your environment and use case.

    Installation options:

    • Docker: Fastest for development and testing
    • Binary archive: Direct download for any platform
    • Package managers: APT, YUM, Homebrew for production
    • Kubernetes: Elastic Cloud on Kubernetes (ECK) operator
    • Elastic Cloud: Fully managed SaaS offering

    System requirements:

    • 2+ GB RAM (4+ GB recommended for production)
    • 64-bit OS (Linux, macOS, Windows)
    • Java bundled with distribution (no separate install needed)
    • Sufficient disk space for indices (varies by use case)
    # Option 1: Docker (quick start)
    docker run -d \
      --name elasticsearch \
      -p 9200:9200 -p 9300:9300 \
      -e "discovery.type=single-node" \
      -e "xpack.security.enabled=false" \
      docker.elastic.co/elasticsearch/elasticsearch:8.13.0
    
    # Verify installation
    curl http://localhost:9200
    # Output: cluster info JSON with version, tagline
    
    # Option 2: Binary (Linux/macOS)
    wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.13.0-linux-x86_64.tar.gz
    tar -xzf elasticsearch-8.13.0-linux-x86_64.tar.gz
    cd elasticsearch-8.13.0/
    ./bin/elasticsearch
    
    # Option 3: Homebrew (macOS)
    brew tap elastic/tap
    brew install elastic/tap/elasticsearch-full
    brew services start elastic/tap/elasticsearch-full
    
    # Option 4: APT (Debian/Ubuntu)
    wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
    echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list
    sudo apt update && sudo apt install elasticsearch
    sudo systemctl enable elasticsearch
    sudo systemctl start elasticsearch
    
    # Option 5: YUM (RHEL/CentOS)
    sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
    cat > /etc/yum.repos.d/elasticsearch.repo << 'EOF'
    [elasticsearch]
    name=Elasticsearch repository for 8.x packages
    baseurl=https://artifacts.elastic.co/packages/8.x/yum
    gpgcheck=1
    gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
    enabled=1
    autorefresh=1
    type=rpm-md
    EOF
    sudo yum install elasticsearch
    sudo systemctl enable elasticsearch
    sudo systemctl start elasticsearch
    
    # Verify
    curl http://localhost:9200
  4. Step 4

    Basic Configuration

    Elasticsearch uses YAML configuration files located in config/ directory. Key files are elasticsearch.yml (main config) and jvm.options (JVM settings).

    Essential settings:

    • cluster.name: Cluster identifier (nodes with same name join together)
    • node.name: Human-readable node identifier
    • path.data: Where indices are stored (critical for backups)
    • path.logs: Log file location
    • network.host: Network binding address
    • http.port: HTTP API port (default 9200)
    • discovery.seed_hosts: Bootstrap cluster discovery
    • cluster.initial_master_nodes: Initial master-eligible nodes
    # config/elasticsearch.yml - Basic single-node configuration
    
    cluster.name: my-application
    node.name: node-1
    
    # Data and logs paths
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    
    # Network settings
    network.host: 0.0.0.0
    http.port: 9200
    
    # Single-node cluster (development)
    discovery.type: single-node
    
    # Security (disable for development, enable for production)
    xpack.security.enabled: false
    xpack.security.enrollment.enabled: false
    
    # --- Production multi-node configuration ---
    
    # cluster.yml for 3-node cluster
    cluster.name: production-cluster
    node.name: ${HOSTNAME}  # Set via environment variable
    
    # Node roles (can combine multiple)
    node.roles: [ master, data, ingest ]
    
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    
    network.host: 0.0.0.0
    http.port: 9200
    transport.port: 9300
    
    # Cluster discovery
    discovery.seed_hosts:
      - es-node1:9300
      - es-node2:9300
      - es-node3:9300
    
    cluster.initial_master_nodes:
      - es-node1
      - es-node2
      - es-node3
    
    # Security
    xpack.security.enabled: true
    xpack.security.transport.ssl.enabled: true
    xpack.security.transport.ssl.verification_mode: certificate
    xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
    xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12
  5. Step 5

    JVM and Memory Configuration

    Elasticsearch is a Java application, so JVM tuning is critical for performance. The heap size is the most important setting.

    Heap size rules:

    • Set -Xms and -Xmx to the same value (prevents heap resizing)
    • Never exceed 50% of physical RAM (leave space for OS file cache)
    • Never exceed ~31 GB (compressed object pointers threshold)
    • For a 64 GB server, set heap to 31 GB
    • For a 16 GB server, set heap to 8 GB
    # config/jvm.options - JVM heap settings
    
    # Set heap size (example: 8 GB for a 16 GB server)
    -Xms8g
    -Xmx8g
    
    # Production recommended settings
    -XX:+UseG1GC
    -XX:G1ReservePercent=25
    -XX:InitiatingHeapOccupancyPercent=30
    
    # Heap dump on out-of-memory
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:HeapDumpPath=/var/lib/elasticsearch
    
    # GC logging (helpful for troubleshooting)
    -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
    
    # Environment variable approach (overrides jvm.options)
    export ES_JAVA_OPTS="-Xms8g -Xmx8g"
    ./bin/elasticsearch
    
    # Docker environment variable
    docker run -d \
      -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" \
      docker.elastic.co/elasticsearch/elasticsearch:8.13.0
  6. Step 6

    First Steps: Creating an Index and Adding Documents

    Elasticsearch stores data in indices (similar to databases) containing documents (similar to rows). Documents are JSON objects. Let's create an index and add documents.

    Key concepts:

    • Index: Collection of documents with similar characteristics
    • Document: Basic unit of information (JSON)
    • Field: Key-value pair in a document
    • Mapping: Schema definition (field types and settings)
    • Shard: Index subdivision for horizontal scaling
    • Replica: Shard copy for high availability
    # Create an index with explicit mapping
    curl -X PUT "localhost:9200/products" -H 'Content-Type: application/json' -d'
    {
      "mappings": {
        "properties": {
          "name": { "type": "text" },
          "description": { "type": "text" },
          "price": { "type": "float" },
          "category": { "type": "keyword" },
          "tags": { "type": "keyword" },
          "in_stock": { "type": "boolean" },
          "created_at": { "type": "date" }
        }
      }
    }'
    
    # Add a document (POST generates auto ID)
    curl -X POST "localhost:9200/products/_doc" -H 'Content-Type: application/json' -d'
    {
      "name": "Wireless Headphones",
      "description": "High-quality noise-cancelling headphones",
      "price": 299.99,
      "category": "electronics",
      "tags": ["audio", "wireless", "bluetooth"],
      "in_stock": true,
      "created_at": "2024-01-15T10:30:00Z"
    }'
    
    # Add a document with specific ID
    curl -X PUT "localhost:9200/products/_doc/1" -H 'Content-Type: application/json' -d'
    {
      "name": "USB-C Cable",
      "description": "Fast charging cable 2m",
      "price": 19.99,
      "category": "accessories",
      "tags": ["cable", "usb-c"],
      "in_stock": true,
      "created_at": "2024-01-16T14:20:00Z"
    }'
    
    # Bulk indexing (faster for multiple documents)
    curl -X POST "localhost:9200/_bulk" -H 'Content-Type: application/json' --data-binary @- << 'EOF'
    { "index": { "_index": "products" } }
    { "name": "Laptop Stand", "price": 49.99, "category": "accessories", "in_stock": true }
    { "index": { "_index": "products" } }
    { "name": "Mechanical Keyboard", "price": 149.99, "category": "electronics", "in_stock": false }
    EOF
    
    # Retrieve a document by ID
    curl -X GET "localhost:9200/products/_doc/1"
    
    # Get index mapping
    curl -X GET "localhost:9200/products/_mapping"
    
    # Get index stats
    curl -X GET "localhost:9200/products/_stats"
  7. Step 7

    Search Queries: From Simple to Complex

    Elasticsearch provides a rich Query DSL (Domain Specific Language) for searching documents. Queries range from simple text matches to complex boolean logic.

    Query types:

    • Match: Full-text search with analysis
    • Term: Exact match (no analysis)
    • Range: Numeric or date ranges
    • Bool: Combine queries with AND/OR/NOT logic
    • Wildcard: Pattern matching with * and ?
    • Fuzzy: Approximate matching (typo tolerance)
    • Nested: Query nested objects
    • Geo: Geographic queries
    # Simple match query (full-text search)
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match": {
          "description": "wireless headphones"
        }
      }
    }'
    
    # Match with size and pagination
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "from": 0,
      "size": 10,
      "query": {
        "match": {
          "name": "cable"
        }
      }
    }'
    
    # Multi-match (search across multiple fields)
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "multi_match": {
          "query": "wireless",
          "fields": ["name", "description"]
        }
      }
    }'
    
    # Term query (exact match, no analysis)
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "term": {
          "category": "electronics"
        }
      }
    }'
    
    # Range query
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "range": {
          "price": {
            "gte": 50,
            "lte": 200
          }
        }
      }
    }'
    
    # Bool query (complex logic)
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "bool": {
          "must": [
            { "match": { "description": "wireless" } }
          ],
          "filter": [
            { "term": { "in_stock": true } },
            { "range": { "price": { "lte": 500 } } }
          ],
          "must_not": [
            { "term": { "category": "refurbished" } }
          ],
          "should": [
            { "match": { "tags": "bluetooth" } }
          ],
          "minimum_should_match": 1
        }
      }
    }'
    
    # Wildcard query
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "wildcard": {
          "name": "*phone*"
        }
      }
    }'
    
    # Fuzzy query (typo tolerance)
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "fuzzy": {
          "name": {
            "value": "hedphones",
            "fuzziness": "AUTO"
          }
        }
      }
    }'
  8. Step 8

    Aggregations: Analytics and Metrics

    Aggregations provide analytics over your data. Think of them as SQL GROUP BY on steroids. There are three types: metric (calculate metrics), bucket (group documents), and pipeline (aggregate aggregation results).

    Common aggregations:

    • Metrics: avg, sum, min, max, stats, cardinality, percentiles
    • Bucket: terms (group by field), date_histogram, range, filters
    • Pipeline: derivative, cumulative_sum, moving_average
    # Terms aggregation (group by category)
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "size": 0,
      "aggs": {
        "categories": {
          "terms": {
            "field": "category"
          }
        }
      }
    }'
    
    # Average price per category
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "size": 0,
      "aggs": {
        "categories": {
          "terms": {
            "field": "category"
          },
          "aggs": {
            "avg_price": {
              "avg": {
                "field": "price"
              }
            }
          }
        }
      }
    }'
    
    # Stats aggregation (min, max, avg, sum, count)
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "size": 0,
      "aggs": {
        "price_stats": {
          "stats": {
            "field": "price"
          }
        }
      }
    }'
    
    # Date histogram (time-series data)
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "size": 0,
      "aggs": {
        "products_over_time": {
          "date_histogram": {
            "field": "created_at",
            "calendar_interval": "month"
          },
          "aggs": {
            "total_revenue": {
              "sum": {
                "field": "price"
              }
            }
          }
        }
      }
    }'
    
    # Percentiles aggregation
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "size": 0,
      "aggs": {
        "price_percentiles": {
          "percentiles": {
            "field": "price",
            "percents": [50, 75, 90, 95, 99]
          }
        }
      }
    }'
    
    # Range aggregation
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "size": 0,
      "aggs": {
        "price_ranges": {
          "range": {
            "field": "price",
            "ranges": [
              { "to": 50 },
              { "from": 50, "to": 100 },
              { "from": 100 }
            ]
          }
        }
      }
    }'
  9. Step 9

    Index Management: Mappings, Aliases, and Templates

    Effective index management is crucial for performance and maintainability. This includes defining mappings, using aliases for zero-downtime reindexing, and templates for consistent settings.

    Best practices:

    • Define explicit mappings (don't rely on dynamic mapping for production)
    • Use index aliases for production indices
    • Create index templates for time-series data
    • Set appropriate shard counts (over-sharding hurts performance)
    • Use index lifecycle management (ILM) for data retention
    # Update mapping (add new field to existing index)
    curl -X PUT "localhost:9200/products/_mapping" -H 'Content-Type: application/json' -d'
    {
      "properties": {
        "manufacturer": {
          "type": "keyword"
        },
        "rating": {
          "type": "float"
        }
      }
    }'
    
    # Create index alias
    curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
    {
      "actions": [
        {
          "add": {
            "index": "products",
            "alias": "products-latest"
          }
        }
      ]
    }'
    
    # Atomic alias switch (zero-downtime reindex)
    curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
    {
      "actions": [
        { "remove": { "index": "products-v1", "alias": "products" } },
        { "add": { "index": "products-v2", "alias": "products" } }
      ]
    }'
    
    # Create index template
    curl -X PUT "localhost:9200/_index_template/logs_template" -H 'Content-Type: application/json' -d'
    {
      "index_patterns": ["logs-*"],
      "template": {
        "settings": {
          "number_of_shards": 1,
          "number_of_replicas": 1,
          "refresh_interval": "5s"
        },
        "mappings": {
          "properties": {
            "timestamp": { "type": "date" },
            "level": { "type": "keyword" },
            "message": { "type": "text" },
            "service": { "type": "keyword" }
          }
        }
      }
    }'
    
    # Now any index matching logs-* gets these settings
    curl -X PUT "localhost:9200/logs-2024-01-15"
    
    # Reindex data from old index to new
    curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
    {
      "source": {
        "index": "products-old"
      },
      "dest": {
        "index": "products-new"
      }
    }'
    
    # Delete index
    curl -X DELETE "localhost:9200/products-old"
    
    # Get all indices
    curl -X GET "localhost:9200/_cat/indices?v"
    
    # Get cluster health
    curl -X GET "localhost:9200/_cluster/health?pretty"
  10. Step 10

    Production Cluster Setup

    Production deployments require a multi-node cluster for high availability and scalability. A typical setup includes dedicated master, data, and ingest nodes.

    Node roles:

    • Master: Cluster state management (lightweight, 3+ nodes for quorum)
    • Data: Store indices and execute queries (most resources)
    • Ingest: Pre-process documents (optional)
    • Coordinating: Route requests (no data, no master)
    • ML: Machine learning (optional)

    Minimum production cluster: 3 master-eligible nodes + 2+ data nodes

    # Master node config (es-master-1)
    cluster.name: production
    node.name: master-1
    node.roles: [ master ]
    
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    
    network.host: 0.0.0.0
    http.port: 9200
    transport.port: 9300
    
    discovery.seed_hosts:
      - master-1:9300
      - master-2:9300
      - master-3:9300
    
    cluster.initial_master_nodes:
      - master-1
      - master-2
      - master-3
    
    xpack.security.enabled: true
    xpack.security.transport.ssl.enabled: true
    
    ---
    
    # Data node config (es-data-1)
    cluster.name: production
    node.name: data-1
    node.roles: [ data, ingest ]
    
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    
    network.host: 0.0.0.0
    http.port: 9200
    transport.port: 9300
    
    discovery.seed_hosts:
      - master-1:9300
      - master-2:9300
      - master-3:9300
    
    xpack.security.enabled: true
    xpack.security.transport.ssl.enabled: true
    
    # Hot/Warm architecture (data tiers)
    node.attr.data: hot  # or warm, cold
    
    ---
    
    # Coordinating-only node (load balancer)
    cluster.name: production
    node.name: coordinator-1
    node.roles: [ ]  # No roles = coordinating only
    
    network.host: 0.0.0.0
    http.port: 9200
    transport.port: 9300
    
    discovery.seed_hosts:
      - master-1:9300
      - master-2:9300
      - master-3:9300
  11. Step 11

    Security: Authentication and TLS

    Elasticsearch security features (X-Pack Security) provide authentication, authorization, and encryption. Essential for production deployments.

    Security layers:

    • TLS: Encrypt HTTP and transport communication
    • Authentication: Built-in, LDAP, Active Directory, SAML, OpenID Connect
    • Authorization: Role-based access control (RBAC)
    • Audit logging: Track security events
    • Field/document level security: Fine-grained access control
    # Generate certificates for inter-node communication
    cd /usr/share/elasticsearch
    bin/elasticsearch-certutil ca --pem
    # Creates elastic-stack-ca.zip
    
    unzip elastic-stack-ca.zip
    bin/elasticsearch-certutil cert \
      --ca-cert ca/ca.crt \
      --ca-key ca/ca.key \
      --pem \
      --name node1 \
      --dns node1.example.com \
      --ip 192.168.1.10
    
    # Copy certificates to config/certs/
    mkdir config/certs
    cp node1/node1.crt node1/node1.key ca/ca.crt config/certs/
    chmod 644 config/certs/*
    
    # Enable security in elasticsearch.yml
    xpack.security.enabled: true
    
    # TLS for transport (inter-node)
    xpack.security.transport.ssl.enabled: true
    xpack.security.transport.ssl.verification_mode: certificate
    xpack.security.transport.ssl.certificate: certs/node1.crt
    xpack.security.transport.ssl.key: certs/node1.key
    xpack.security.transport.ssl.certificate_authorities: [ "certs/ca.crt" ]
    
    # TLS for HTTP (client connections)
    xpack.security.http.ssl.enabled: true
    xpack.security.http.ssl.certificate: certs/node1.crt
    xpack.security.http.ssl.key: certs/node1.key
    xpack.security.http.ssl.certificate_authorities: [ "certs/ca.crt" ]
    
    # Set built-in user passwords
    bin/elasticsearch-setup-passwords auto
    # Or interactive:
    bin/elasticsearch-setup-passwords interactive
    
    # Create custom user
    curl -X POST "https://localhost:9200/_security/user/john" \
      -u elastic:password -k \
      -H 'Content-Type: application/json' -d'
    {
      "password" : "s3cr3t",
      "roles" : [ "kibana_admin", "monitoring_user" ],
      "full_name" : "John Doe",
      "email" : "john@example.com"
    }'
    
    # Create custom role
    curl -X POST "https://localhost:9200/_security/role/products_read" \
      -u elastic:password -k \
      -H 'Content-Type: application/json' -d'
    {
      "indices": [
        {
          "names": [ "products*" ],
          "privileges": [ "read" ]
        }
      ]
    }'
    
    # Test authenticated request
    curl -u john:s3cr3t -k https://localhost:9200/_cluster/health
  12. Step 12

    Monitoring and Observability

    Monitor Elasticsearch health and performance using built-in APIs and the Elastic Stack (formerly ELK Stack).

    Key metrics to monitor:

    • Cluster health (green/yellow/red)
    • Node CPU, memory, disk usage
    • JVM heap usage and GC times
    • Query latency and throughput
    • Indexing rate and latency
    • Shard count and size
    • Rejected threads (thread pool saturation)
    # Cluster health
    curl -X GET "localhost:9200/_cluster/health?pretty"
    # Status: green (all good), yellow (replicas missing), red (primary missing)
    
    # Node stats (detailed metrics)
    curl -X GET "localhost:9200/_nodes/stats?pretty"
    
    # Index stats
    curl -X GET "localhost:9200/products/_stats?pretty"
    
    # Thread pool stats (watch for rejections)
    curl -X GET "localhost:9200/_cat/thread_pool?v&h=name,queue,active,rejected,completed"
    
    # Pending tasks (should be near zero)
    curl -X GET "localhost:9200/_cluster/pending_tasks"
    
    # Hot threads (troubleshoot CPU spikes)
    curl -X GET "localhost:9200/_nodes/hot_threads"
    
    # Recovery status (ongoing shard movements)
    curl -X GET "localhost:9200/_cat/recovery?v&active_only=true"
    
    # Allocation explanation (why shard isn't allocated)
    curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"
    
    # Enable slow log for queries (add to elasticsearch.yml)
    index.search.slowlog.threshold.query.warn: 10s
    index.search.slowlog.threshold.query.info: 5s
    index.search.slowlog.threshold.query.debug: 2s
    
    index.indexing.slowlog.threshold.index.warn: 10s
    index.indexing.slowlog.threshold.index.info: 5s
    
    # Or set dynamically
    curl -X PUT "localhost:9200/products/_settings" -H 'Content-Type: application/json' -d'
    {
      "index.search.slowlog.threshold.query.warn": "10s",
      "index.search.slowlog.threshold.fetch.debug": "500ms"
    }'
    
    # Metricbeat for comprehensive monitoring
    # Download and configure Metricbeat, then enable elasticsearch module
    metricbeat modules enable elasticsearch
    metricbeat setup
    metricbeat -e
  13. Step 13

    Client Libraries and Language SDKs

    Elasticsearch provides official clients for many programming languages. All clients support the full REST API with language-specific idioms.

    Official clients:

    • Java (High-Level REST Client, Java API Client)
    • Python (elasticsearch-py)
    • JavaScript/Node.js (@elastic/elasticsearch)
    • Go (go-elasticsearch)
    • Ruby (elasticsearch-ruby)
    • PHP (elasticsearch-php)
    • .NET (Elasticsearch.Net, NEST)
    • Rust (elasticsearch-rs)
    # Python client example
    from elasticsearch import Elasticsearch
    
    # Create client
    es = Elasticsearch(
        ["http://localhost:9200"],
        basic_auth=("elastic", "password")
    )
    
    # Check cluster health
    health = es.cluster.health()
    print(f"Cluster status: {health['status']}")
    
    # Index a document
    response = es.index(
        index="products",
        id=1,
        document={
            "name": "Laptop",
            "price": 999.99,
            "category": "electronics"
        }
    )
    print(f"Indexed: {response['result']}")
    
    # Search
    response = es.search(
        index="products",
        body={
            "query": {
                "match": {
                    "name": "laptop"
                }
            }
        }
    )
    
    for hit in response['hits']['hits']:
        print(f"{hit['_source']['name']}: ${hit['_source']['price']}")
    
    # Aggregation
    response = es.search(
        index="products",
        body={
            "size": 0,
            "aggs": {
                "categories": {
                    "terms": {
                        "field": "category"
                    }
                }
            }
        }
    )
    
    for bucket in response['aggregations']['categories']['buckets']:
        print(f"{bucket['key']}: {bucket['doc_count']} products")
  14. Step 14

    JavaScript/Node.js Client Example

    The official JavaScript client works in both Node.js and browser environments with full TypeScript support.

    // npm install @elastic/elasticsearch
    
    const { Client } = require('@elastic/elasticsearch');
    
    // Create client
    const client = new Client({
      node: 'http://localhost:9200',
      auth: {
        username: 'elastic',
        password: 'password'
      }
    });
    
    // Index a document
    async function indexDocument() {
      const result = await client.index({
        index: 'products',
        id: 1,
        document: {
          name: 'Smartphone',
          price: 699.99,
          category: 'electronics',
          in_stock: true
        }
      });
      console.log('Indexed:', result.result);
    }
    
    // Search with bool query
    async function search() {
      const result = await client.search({
        index: 'products',
        query: {
          bool: {
            must: [
              { match: { category: 'electronics' } }
            ],
            filter: [
              { range: { price: { lte: 1000 } } }
            ]
          }
        },
        sort: [
          { price: 'asc' }
        ],
        size: 10
      });
    
      result.hits.hits.forEach(hit => {
        console.log(`${hit._source.name}: $${hit._source.price}`);
      });
    }
    
    // Aggregation with sub-aggregation
    async function aggregate() {
      const result = await client.search({
        index: 'products',
        size: 0,
        aggs: {
          by_category: {
            terms: { field: 'category' },
            aggs: {
              avg_price: {
                avg: { field: 'price' }
              }
            }
          }
        }
      });
    
      result.aggregations.by_category.buckets.forEach(bucket => {
        console.log(`${bucket.key}: ${bucket.doc_count} items, avg price $${bucket.avg_price.value.toFixed(2)}`);
      });
    }
    
    // Run examples
    async function main() {
      await indexDocument();
      await search();
      await aggregate();
    }
    
    main().catch(console.error);
  15. Step 15

    Common Use Cases

    Elasticsearch excels in several key domains:

    1. Full-text search: Power website search, e-commerce product search, documentation search. Think GitHub code search, Stack Overflow, Netflix search.

    2. Log and event analytics: Centralize logs from applications and infrastructure. The "L" in the ELK/Elastic Stack (Elasticsearch + Logstash + Kibana).

    3. Metrics and APM: Store and analyze application performance metrics, infrastructure metrics, business metrics.

    4. Security analytics: SIEM (Security Information and Event Management), threat detection, audit logs. Elastic Security provides pre-built detections.

    5. Business analytics: Real-time dashboards, customer behavior analytics, sales analytics. Kibana provides visualization layer.

    6. Geospatial: Location-based search, geographic analytics, ride-sharing, delivery optimization.

    7. Machine learning: Anomaly detection, forecasting, outlier detection via X-Pack ML.

  16. Step 16

    Performance Tuning Best Practices

    Optimize Elasticsearch for your specific workload:

    Indexing performance:

    • Increase refresh_interval during bulk indexing (default 1s → 30s or -1)
    • Disable replicas during initial load, re-enable after
    • Use bulk API instead of individual index requests
    • Increase index.translog.flush_threshold_size for write-heavy loads

    Query performance:

    • Use filters instead of queries when possible (filters are cached)
    • Avoid deep pagination (use search_after instead of from/size)
    • Use doc values for sorting and aggregations
    • Limit _source fields returned (_source_includes)
    • Use index aliases for zero-downtime reindexing

    Shard sizing:

    • Target 20-40 GB per shard for search workloads
    • Target 40-50 GB per shard for logging workloads
    • Avoid over-sharding (1000s of tiny shards hurt performance)
    • Use shrink API to reduce shard count
    • Use rollover for time-series data

    Memory:

    • 50% heap, 50% OS file cache is the golden rule
    • Monitor JVM heap usage (target <75%)
    • Use G1GC for heaps >4GB
    • Consider disabling swapping (bootstrap.memory_lock: true)
    # Disable refresh during bulk indexing
    curl -X PUT "localhost:9200/products/_settings" -H 'Content-Type: application/json' -d'
    {
      "index": {
        "refresh_interval": "-1",
        "number_of_replicas": 0
      }
    }'
    
    # Bulk index (do your indexing here)
    
    # Re-enable refresh and replicas
    curl -X PUT "localhost:9200/products/_settings" -H 'Content-Type: application/json' -d'
    {
      "index": {
        "refresh_interval": "1s",
        "number_of_replicas": 1
      }
    }'
    
    # Force merge after bulk indexing (optimize segments)
    curl -X POST "localhost:9200/products/_forcemerge?max_num_segments=1"
    
    # Use search_after for deep pagination (more efficient than from/size)
    curl -X GET "localhost:9200/products/_search" -H 'Content-Type: application/json' -d'
    {
      "size": 10,
      "sort": [
        { "created_at": "asc" },
        { "_id": "asc" }
      ]
    }'
    # Use last hit's sort values in search_after for next page
    
    # Disable swapping (add to elasticsearch.yml)
    bootstrap.memory_lock: true
    
    # Then run on Linux:
    sudo systemctl edit elasticsearch
    # Add:
    [Service]
    LimitMEMLOCK=infinity
  17. Step 17

    Backup and Restore

    Elasticsearch snapshots provide backup and disaster recovery. Snapshots are incremental and stored in a repository (filesystem, S3, GCS, Azure).

    Best practices:

    • Automate snapshots (daily or hourly)
    • Store snapshots off-cluster (S3, GCS, Azure)
    • Test restores regularly
    • Use Snapshot Lifecycle Management (SLM) for automation
    # Register snapshot repository (filesystem)
    curl -X PUT "localhost:9200/_snapshot/my_backup" -H 'Content-Type: application/json' -d'
    {
      "type": "fs",
      "settings": {
        "location": "/mount/backups/elasticsearch"
      }
    }'
    
    # Add to elasticsearch.yml first:
    # path.repo: ["/mount/backups/elasticsearch"]
    
    # Create snapshot
    curl -X PUT "localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true"
    
    # Snapshot specific indices
    curl -X PUT "localhost:9200/_snapshot/my_backup/snapshot_2" -H 'Content-Type: application/json' -d'
    {
      "indices": "products,logs-*",
      "ignore_unavailable": true,
      "include_global_state": false
    }'
    
    # List snapshots
    curl -X GET "localhost:9200/_snapshot/my_backup/_all?pretty"
    
    # Restore snapshot
    curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_1/_restore" -H 'Content-Type: application/json' -d'
    {
      "indices": "products",
      "ignore_unavailable": true,
      "include_global_state": false,
      "rename_pattern": "products",
      "rename_replacement": "restored-products"
    }'
    
    # S3 repository (AWS)
    curl -X PUT "localhost:9200/_snapshot/s3_backup" -H 'Content-Type: application/json' -d'
    {
      "type": "s3",
      "settings": {
        "bucket": "my-es-backups",
        "region": "us-east-1",
        "base_path": "elasticsearch/snapshots"
      }
    }'
    
    # Requires repository-s3 plugin:
    # bin/elasticsearch-plugin install repository-s3
    # Configure AWS credentials in elasticsearch-keystore
    
    # Delete old snapshots
    curl -X DELETE "localhost:9200/_snapshot/my_backup/snapshot_1"
  18. Step 18

    Kubernetes Deployment with ECK

    Elastic Cloud on Kubernetes (ECK) is the official operator for deploying and managing Elasticsearch on Kubernetes. It automates deployment, upgrades, scaling, and monitoring.

    # Install ECK operator
    kubectl create -f https://download.elastic.co/downloads/eck/2.12.0/crds.yaml
    kubectl apply -f https://download.elastic.co/downloads/eck/2.12.0/operator.yaml
    
    # Verify operator is running
    kubectl -n elastic-system logs -f statefulset.apps/elastic-operator
    
    # Deploy Elasticsearch cluster
    cat <<EOF | kubectl apply -f -
    apiVersion: elasticsearch.k8s.elastic.co/v1
    kind: Elasticsearch
    metadata:
      name: production
      namespace: elastic
    spec:
      version: 8.13.0
      nodeSets:
      - name: master
        count: 3
        config:
          node.roles: ["master"]
        volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 10Gi
            storageClassName: fast-ssd
      - name: data
        count: 3
        config:
          node.roles: ["data", "ingest"]
        volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 100Gi
            storageClassName: fast-ssd
        podTemplate:
          spec:
            containers:
            - name: elasticsearch
              resources:
                requests:
                  memory: 8Gi
                  cpu: 2
                limits:
                  memory: 8Gi
                  cpu: 4
              env:
              - name: ES_JAVA_OPTS
                value: "-Xms4g -Xmx4g"
    EOF
    
    # Get cluster password
    PASSWORD=$(kubectl get secret production-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')
    echo "Elasticsearch password: $PASSWORD"
    
    # Access via port-forward
    kubectl port-forward service/production-es-http 9200
    curl -u "elastic:$PASSWORD" -k "https://localhost:9200"
    
    # Scale data nodes
    kubectl patch elasticsearch production --type='merge' -p '
    {
      "spec": {
        "nodeSets": [
          {"name": "data", "count": 5}
        ]
      }
    }'
  19. Step 19

    Resources & Next Steps

    Documentation:

    Community:

    Related tools:

    • Kibana - Visualization and dashboards
    • Logstash - Data pipeline and ingestion
    • Beats - Lightweight data shippers
    • APM - Application performance monitoring
    • Fleet - Centralized management for Elastic Agents

    Learning:

    Next guides:

    • ELK Stack: Complete log analytics pipeline
    • Kibana: Building dashboards and visualizations
    • Logstash: Data ingestion and transformation
    • Elasticsearch performance tuning deep dive
    GitHub: https://github.com/elastic/elasticsearch
    Official site: https://www.elastic.co/elasticsearch
    Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/
    Downloads: https://www.elastic.co/downloads/elasticsearch
    Community: https://discuss.elastic.co/
    Training: https://www.elastic.co/training

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.