E-commerce Platform: 40% Faster Incident Detection

A mid-market Czech e-commerce company was flying blind during peak traffic. We redesigned their observability stack and cut mean time to detect from 22 minutes to 13.

ElasticsearchKibanaLogstashNew Relic APM

Key Results

40%

Faster incident detection

22→13 min

MTTD reduced

65%

Reduction in alert noise

Problem

Invisible failures during peak load

The client's engineering team had deployed an ELK stack 18 months prior, but it had grown organically into an unmaintainable monolith. All services wrote to a single index, retention was undefined, and dashboard performance was poor enough that engineers had stopped using Kibana during incidents.

During a Black Friday peak, a misconfigured payment service silently returned errors for 22 minutes before an engineer noticed the spike in customer support tickets — not through monitoring.

Single fat index with no per-service isolation

Alert thresholds set to arbitrary static values; 200+ false positives per week

No distributed tracing — impossible to correlate across microservices

Kibana dashboards timing out under query load during incidents

Zero log retention policy — incidents older than 3 days unrecoverable

Solution

Full observability stack redesign in 8 weeks

We started with a two-day architecture workshop to map data flows and define SLOs. From there we rebuilt the ELK cluster with proper role separation, introduced data streams per service, and layered Elasticsearch APM on the three highest-criticality services (checkout, payment, inventory).

Cluster redesign

3 master + 4 hot-data + 2 warm-data + 2 coordinating nodes; ILM policies per service

Structured logging

Worked with dev teams to standardise JSON log schema across 14 microservices

APM instrumentation

Elasticsearch agents deployed on checkout, payment, inventory services; distributed tracing enabled

Alert redesign

Replaced 200+ static alerts with 18 symptom-based NRQL conditions + 4 Elasticsearch Watcher rules

Dashboard rebuild

Purpose-built Kibana dashboards for ops/SRE, pre-filtered per service to prevent query overload

Output

From blind spots to full visibility

Three months post-deployment, the team runs their peak-traffic periods with confidence. The payment incident that took 22 minutes to detect would now trigger a Elasticsearch alert within 90 seconds via anomaly detection on error rate.

13 min

Mean time to detect

Active alert rules (down from 200+)

99.97%

Checkout availability (monitored)

60%

Kibana query time reduction

More implementations

Related Case Studies

50+Logstash pipelines in production

ETL Transformation: From Zero Visibility to Enterprise-Grade Observability

A large enterprise had no centralised monitoring, no log management, and no insight into the state of its internal processes. We designed and delivered a full Elasticsearch platform on OpenShift — with 50+ Logstash pipelines, HA clustering across 4 TB of data, and alerting tunable in seconds.

View case study

400→130Dashboards after migration

Splunk to New Relic: Full Monitoring Migration Without a Single Minute of Lost Visibility

A large-scale monitoring estate — 400 dashboards, 2,000 monitored objects, 700 alert rules — migrated from Splunk to New Relic incrementally and without disruption. Costs were cut in half. Alerting fatigue was eliminated.

View case study

Your environment

Ready to see similar results?

Let's talk about your observability challenges. Free consultation, no obligations.

Book a Free Consultation