Alert Fatigue: How to Build Smarter Monitoring Rules

The Real Cost of Alert Fatigue

Alert fatigue is the silent killer of SRE effectiveness. When on-call engineers receive dozens of non-actionable alerts per shift, they stop trusting the system — and real incidents get missed. We've walked into environments where >70% of alerts were either known false positives or had no documented remediation path.

The Alerting Tiers Framework

Tier 1 — Page immediately: user-visible outage, data loss risk, security incident
Tier 2 — Ticket + Slack: degraded performance, threshold approaching, non-critical errors
Tier 3 — Dashboard only: informational trends, capacity planning signals

Warning

If more than 20% of your pages go unacknowledged within 5 minutes, you have a tier classification problem. Conduct a quarterly alert audit.

Symptom-Based vs. Cause-Based Alerts

Alert on user-visible symptoms (error rate, latency p99, availability) rather than internal causes (CPU, memory, disk). Cause-based alerts create noise; symptom-based alerts create urgency. Use dashboards for causes — use pages for symptoms.

Dynamic Baselines

Static thresholds decay over time as traffic patterns change. Both New Relic AI and Elastic's machine learning features can establish dynamic baselines automatically. We configure ML jobs on key metrics during the first 2 weeks of an engagement, then tune alert conditions against those baselines rather than arbitrary numbers.

Alert Fatigue: How to Build Smarter Monitoring Rules

The Real Cost of Alert Fatigue

The Alerting Tiers Framework

Symptom-Based vs. Cause-Based Alerts

Dynamic Baselines

Related Articles

Building a Production ELK Stack for Enterprise Log Management

Log Aggregation Patterns That Scale: Lessons from the Field

Need expert help implementing this?