Alerting in System Design: Thresholds, SLO Burn Rate, On-Call & Alert Fatigue (Visualized)
Alerting turns raw metrics into actionable notifications that wake the right human at the right time. This guide covers thresholds vs anomaly detection, symptom-based vs cause-based alerts, SLO burn-rate alerting, severity and on-call routing, deduplication and silencing, and how to fight alert fatigue โ with live animations of each idea.
Alerting is the practice of automatically turning monitoring data โ metrics, logs, and traces โ into actionable notifications when a system enters an undesirable state. A good alert reaches a human (or an automated responder) the moment something is wrong, with enough context to act, and stays silent the rest of the time. It is the bridge between passive observability and active incident response.
Collecting metrics tells you what your system is doing; alerting decides which of those signals deserve human attention. The hard part is not detecting anomalies โ it is detecting the right anomalies, routing them to the right responder, and not drowning that responder in noise. This article walks through how alerts are defined, evaluated, routed, and tuned.
From Metrics to Actionable Notifications
An alerting pipeline has four stages: (1) a time-series database like Prometheus continuously evaluates rules against incoming metrics, (2) a rule that holds true for some duration fires an alert, (3) an alert manager deduplicates, groups, and routes it, and (4) a notifier such as PagerDuty pages the on-call engineer. The art lives in choosing what condition is worth firing on.
Thresholds vs Anomaly Detection
The simplest alert is a static threshold: fire when latency p99 exceeds 500ms or error rate exceeds 1%. Thresholds are easy to reason about and easy to tune, but they are blind to context โ a value that is normal at peak traffic may be alarming at 3am. Anomaly detection instead learns a baseline and fires when a metric deviates from its expected range, catching problems a fixed line would miss but at the cost of complexity and false positives. A pragmatic middle ground is rate-of-change alerting: fire when a metric moves too fast, regardless of its absolute value.
Symptom-Based vs Cause-Based Alerting
Symptom-based alerts fire on what your users actually experience: high error rate, slow responses, failed checkouts. Cause-based alerts fire on internal conditions that might lead to a symptom: high CPU, low disk, a full queue. Google's SRE guidance is to alert primarily on symptoms, because they are user-visible and far fewer in number; reserve cause-based alerts for conditions that are imminent and unambiguous (disk will be full in 30 minutes). Alerting on every cause produces a flood of pages for problems the system absorbs on its own.
Alerting on SLO Burn Rate
An SLO (service level objective) defines an acceptable level of reliability โ say 99.9% of requests succeed over 30 days. The 0.1% you are allowed to fail is your error budget. Instead of paging the instant a single request fails, modern alerting watches the burn rate: how fast you are consuming that budget. A burn rate of 1ร exhausts the budget exactly at the end of the window; a burn rate of 14ร will exhaust a month's budget in roughly two days. Fast burn pages immediately and loudly; slow burn opens a ticket. This ties alert urgency directly to user-facing impact and is far quieter than raw threshold alerts.
Severity, Routing & On-Call
Not every alert should wake someone at 3am. Alerts carry a severity โ typically critical (page immediately), warning (notify, look during business hours), and info (log only). Routing rules in Alertmanager match labels (team, service, severity) and send each alert to the right destination: critical alerts page the on-call engineer through PagerDuty or Opsgenie with an escalation policy, warnings drop into a Slack channel, the rest into a dashboard. The on-call rotation and escalation chain ensure that if the first responder does not acknowledge within minutes, the page escalates to a secondary.
# Prometheus alerting rule โ symptom-based, with a 'for' to avoid flapping
groups:
- name: api-slo
rules:
- alert: HighErrorRateFastBurn
# 14.4x burn of a 99.9% SLO over 1h means budget gone in ~2 days
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
team: payments
annotations:
summary: "Fast error-budget burn on {{ $labels.service }}"
runbook: "https://runbooks.internal/payments/error-budget"
# Alertmanager routing โ page critical, Slack the rest; group + dedupe
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: slack-warnings
routes:
- matchers: [ severity="critical" ]
receiver: pagerduty-oncallDeduplication, Grouping & Silencing
When a database goes down, every one of fifty app servers may fire the same alert at once. Without help, that is fifty pages. An alert manager solves this with three mechanisms: deduplication collapses identical alerts into one, grouping bundles related alerts (same cluster, same cause) into a single notification, and silencing temporarily mutes known issues โ during a planned deploy or maintenance window โ so they don't page. The result is one meaningful notification instead of a storm.
Alert Fatigue & Reducing Noise
Alert fatigue is what happens when engineers receive so many alerts โ especially false or non-actionable ones โ that they start ignoring them. It is the single most dangerous failure mode in alerting, because the one real page gets lost in the noise. The cure is discipline: every alert must be actionable and tied to user impact, use a for duration to suppress flapping, prefer symptom over cause, alert on SLO burn rate rather than raw thresholds, and ruthlessly delete or downgrade alerts that page without anyone needing to act. A good rule of thumb: if an alert fires and the on-call response is "nothing to do, it recovered," that alert should not page.
Good Alerts vs Bad Alerts
| Property | Good alert | Bad alert |
|---|---|---|
| Trigger | User-visible symptom (errors, latency) | Internal cause that may not matter (CPU 80%) |
| Actionability | Always has a clear action + runbook | Fires, then 'recovered, nothing to do' |
| Noise control | Has a 'for' duration; deduped & grouped | Flaps on every spike; one per server |
| Urgency | Severity matches real impact (burn rate) | Everything is 'critical', pages at 3am |
| Outcome | Trusted โ responders act on it | Ignored โ fuels alert fatigue |
Runbooks
Every paging alert should link to a runbook โ a short document that tells the responder what the alert means, how to confirm the problem, the first diagnostic steps, and how to mitigate. Runbooks turn a 3am page from a panic into a checklist, shrink mean-time-to-recovery, and let less-experienced engineers handle incidents. In Prometheus this is just a runbook annotation on the rule; in Grafana you can attach it to the alert. An alert without a runbook is a question with no answer attached.
Comparing Alert Strategies
| Strategy | Fires when | Best for | Watch out for |
|---|---|---|---|
| Static threshold | Metric crosses a fixed line | Hard limits (disk full, queue cap) | Wrong line = noise or missed issues |
| Rate-of-change | Metric moves too fast | Sudden traffic or error spikes | Misses slow gradual degradation |
| Anomaly detection | Metric deviates from learned baseline | Seasonal traffic, unknown normal | False positives; complex to tune |
| SLO burn rate | Error budget burns too fast | User-facing reliability, fewer pages | Requires a defined SLO + good SLIs |
Frequently Asked Questions
What is the difference between monitoring and alerting?
Monitoring is the continuous collection and storage of metrics, logs, and traces so you can observe system behavior. Alerting sits on top of monitoring: it evaluates rules against that data and notifies a human or automation when something needs attention. Monitoring tells you everything; alerting tells you only what is worth waking someone up for.
Why alert on SLO burn rate instead of a simple error-rate threshold?
A fixed error-rate threshold pages on every transient spike and ignores how much reliability budget you actually have left. Burn-rate alerting ties urgency to impact: a fast burn that would exhaust your monthly budget in hours pages loudly, while a slow burn opens a ticket. You get fewer, more meaningful pages and you are alerted before the SLO is actually violated rather than after.
How do I reduce alert fatigue?
Make every paging alert actionable and tied to user impact, alert on symptoms rather than causes, add a for duration to stop flapping, dedupe and group related alerts, silence known maintenance, and route by severity so only true emergencies page. Then review fired alerts regularly and delete or downgrade any that never required action. Quiet, trusted alerts are the goal.
The best alert is the one you trust enough to act on at 3am โ and quiet enough that you never doubt it's real. Every page must be actionable, or it is just noise waiting to hide a real outage.
โ alokknight Engineering
