Building a Production ELK Stack for Enterprise Log Management

Why the ELK Stack Still Wins

Despite the rise of managed observability SaaS offerings, the Elasticsearch–Logstash–Kibana (ELK) stack remains the go-to choice for organisations that need full control over their data pipeline. At RNX we've deployed ELK in banking, energy, and large e-commerce environments — and the patterns we've refined over those engagements are what this article is about.

Cluster Topology for Reliability

The most common mistake we see in production ELK deployments is under-sizing the cluster and conflating node roles. A resilient production cluster needs dedicated nodes for each role:

Master nodes (3) — cluster state, shard allocation; never place data on these
Data nodes (≥3) — hot/warm/cold tiering based on query latency SLAs
Ingest nodes — pre-processing pipelines; offload from data nodes
Coordinating nodes — receive client requests, fan out searches

Tip

For most enterprise workloads, start with 3 master + 3 hot-data + 2 coordinating nodes. Add warm-tier data nodes as index age policies kick in.

Index Lifecycle Management

ILM is the single highest-leverage feature in Elasticsearch for controlling storage costs. A well-designed ILM policy moves data through hot → warm → cold → frozen tiers automatically.

json

{
  "policy": {
    "phases": {
      "hot":  { "actions": { "rollover": { "max_age": "1d", "max_size": "50gb" } } },
      "warm": { "min_age": "7d",  "actions": { "shrink": { "number_of_shards": 1 } } },
      "cold": { "min_age": "30d", "actions": { "freeze": {} } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

Logstash vs. Beats vs. OpenTelemetry

Logstash is powerful but heavyweight. For high-throughput environments we often layer Filebeat/Metricbeat at the edge (low CPU footprint) with Logstash as a central transformation layer, and increasingly adopt OpenTelemetry collectors to future-proof the pipeline.

Note

OpenTelemetry is the direction the industry is moving. If you're starting a greenfield deployment today, build your collection layer on OTEL and send to Elasticsearch via the OTLP exporter.

Monitoring the Monitor

A production ELK cluster generates its own monitoring data. We always deploy a separate monitoring cluster (even a small one) to keep cluster health metrics isolated from your primary ingestion pipeline. Use Stack Monitoring or Metricbeat to ship metrics, and configure Watcher alerts for JVM heap, disk watermarks, and shard count.

Building a Production ELK Stack for Enterprise Log Management

Why the ELK Stack Still Wins

Cluster Topology for Reliability

Index Lifecycle Management

Logstash vs. Beats vs. OpenTelemetry

Monitoring the Monitor

Related Articles

Log Aggregation Patterns That Scale: Lessons from the Field

New Relic APM vs. Elastic Observability: A Practitioner's Guide

Need expert help implementing this?