The Real Cost of Alert Fatigue
Alert fatigue is the silent killer of SRE effectiveness. When on-call engineers receive dozens of non-actionable alerts per shift, they stop trusting the system — and real incidents get missed. We've walked into environments where >70% of alerts were either known false positives or had no documented remediation path.
The Alerting Tiers Framework
- Tier 1 — Page immediately: user-visible outage, data loss risk, security incident
- Tier 2 — Ticket + Slack: degraded performance, threshold approaching, non-critical errors
- Tier 3 — Dashboard only: informational trends, capacity planning signals
If more than 20% of your pages go unacknowledged within 5 minutes, you have a tier classification problem. Conduct a quarterly alert audit.
Symptom-Based vs. Cause-Based Alerts
Alert on user-visible symptoms (error rate, latency p99, availability) rather than internal causes (CPU, memory, disk). Cause-based alerts create noise; symptom-based alerts create urgency. Use dashboards for causes — use pages for symptoms.
Dynamic Baselines
Static thresholds decay over time as traffic patterns change. Both New Relic AI and Elastic's machine learning features can establish dynamic baselines automatically. We configure ML jobs on key metrics during the first 2 weeks of an engagement, then tune alert conditions against those baselines rather than arbitrary numbers.