Home/Success Stories/E-commerce Platform: 40% Faster Incident Detection
E-commerce8 weeks engagement150–300 employees

E-commerce Platform: 40% Faster Incident Detection

A mid-market Czech e-commerce company was flying blind during peak traffic. We redesigned their observability stack and cut mean time to detect from 22 minutes to 13.

ElasticsearchKibanaLogstashNew Relic APM

Key Results

40%
Faster incident detection
22→13 min
MTTD reduced
65%
Reduction in alert noise
Problem

Invisible failures during peak load

The client's engineering team had deployed an ELK stack 18 months prior, but it had grown organically into an unmaintainable monolith. All services wrote to a single index, retention was undefined, and dashboard performance was poor enough that engineers had stopped using Kibana during incidents.

During a Black Friday peak, a misconfigured payment service silently returned errors for 22 minutes before an engineer noticed the spike in customer support tickets — not through monitoring.

Single fat index with no per-service isolation

Alert thresholds set to arbitrary static values; 200+ false positives per week

No distributed tracing — impossible to correlate across microservices

Kibana dashboards timing out under query load during incidents

Zero log retention policy — incidents older than 3 days unrecoverable

Solution

Full observability stack redesign in 8 weeks

We started with a two-day architecture workshop to map data flows and define SLOs. From there we rebuilt the ELK cluster with proper role separation, introduced data streams per service, and layered Elasticsearch APM on the three highest-criticality services (checkout, payment, inventory).

1

Cluster redesign

3 master + 4 hot-data + 2 warm-data + 2 coordinating nodes; ILM policies per service

2

Structured logging

Worked with dev teams to standardise JSON log schema across 14 microservices

3

APM instrumentation

Elasticsearch agents deployed on checkout, payment, inventory services; distributed tracing enabled

4

Alert redesign

Replaced 200+ static alerts with 18 symptom-based NRQL conditions + 4 Elasticsearch Watcher rules

5

Dashboard rebuild

Purpose-built Kibana dashboards for ops/SRE, pre-filtered per service to prevent query overload

Output

From blind spots to full visibility

Three months post-deployment, the team runs their peak-traffic periods with confidence. The payment incident that took 22 minutes to detect would now trigger a Elasticsearch alert within 90 seconds via anomaly detection on error rate.

13 min
Mean time to detect
18
Active alert rules (down from 200+)
99.97%
Checkout availability (monitored)
60%
Kibana query time reduction

Your environment

Ready to see similar results?

Let's talk about your observability challenges. Free consultation, no obligations.

Book a Free Consultation