Alerting in System Design: Thresholds, SLO Burn Rate, On-Call & Alert Fatigue (Visualized)

Alerting is the practice of automatically turning monitoring data — metrics, logs, and traces — into actionable notifications when a system enters an undesirable state. A good alert reaches a human (or an automated responder) the moment something is wrong, with enough context to act, and stays silent the rest of the time. It is the bridge between passive observability and active incident response.

Collecting metrics tells you what your system is doing; alerting decides which of those signals deserve human attention. The hard part is not detecting anomalies — it is detecting the right anomalies, routing them to the right responder, and not drowning that responder in noise. This article walks through how alerts are defined, evaluated, routed, and tuned.

From Metrics to Actionable Notifications

An alerting pipeline has four stages: (1) a time-series database like Prometheus continuously evaluates rules against incoming metrics, (2) a rule that holds true for some duration fires an alert, (3) an alert manager deduplicates, groups, and routes it, and (4) a notifier such as PagerDuty pages the on-call engineer. The art lives in choosing what condition is worth firing on.

Thresholds vs Anomaly Detection

The simplest alert is a static threshold: fire when latency p99 exceeds 500ms or error rate exceeds 1%. Thresholds are easy to reason about and easy to tune, but they are blind to context — a value that is normal at peak traffic may be alarming at 3am. Anomaly detection instead learns a baseline and fires when a metric deviates from its expected range, catching problems a fixed line would miss but at the cost of complexity and false positives. A pragmatic middle ground is rate-of-change alerting: fire when a metric moves too fast, regardless of its absolute value.

A metric crossing a threshold and paging on-call

Phase 1 normal → Phase 2 the metric crosses the threshold → Phase 3 the alert fires after the 'for' duration → Phase 4 it is routed to the on-call engineer. The dashed line is the threshold (90); the live value is shown on the right.

Symptom-Based vs Cause-Based Alerting

Symptom-based alerts fire on what your users actually experience: high error rate, slow responses, failed checkouts. Cause-based alerts fire on internal conditions that might lead to a symptom: high CPU, low disk, a full queue. Google's SRE guidance is to alert primarily on symptoms, because they are user-visible and far fewer in number; reserve cause-based alerts for conditions that are imminent and unambiguous (disk will be full in 30 minutes). Alerting on every cause produces a flood of pages for problems the system absorbs on its own.

Alerting on SLO Burn Rate

An SLO (service level objective) defines an acceptable level of reliability — say 99.9% of requests succeed over 30 days. The 0.1% you are allowed to fail is your error budget. Instead of paging the instant a single request fails, modern alerting watches the burn rate: how fast you are consuming that budget. A burn rate of 1× exhausts the budget exactly at the end of the window; a burn rate of 14× will exhaust a month's budget in roughly two days. Fast burn pages immediately and loudly; slow burn opens a ticket. This ties alert urgency directly to user-facing impact and is far quieter than raw threshold alerts.

SLO error-budget burn rate firing before the budget is gone

The budget bar drains as errors occur. Phase 1 slow burn (budget safe) → Phase 2 a fast-burn spike steepens the slope → Phase 3 burn rate exceeds 14× and the alert fires while budget still remains — the whole point of burn-rate alerting.

Severity, Routing & On-Call

Not every alert should wake someone at 3am. Alerts carry a severity — typically critical (page immediately), warning (notify, look during business hours), and info (log only). Routing rules in Alertmanager match labels (team, service, severity) and send each alert to the right destination: critical alerts page the on-call engineer through PagerDuty or Opsgenie with an escalation policy, warnings drop into a Slack channel, the rest into a dashboard. The on-call rotation and escalation chain ensure that if the first responder does not acknowledge within minutes, the page escalates to a secondary.

# Prometheus alerting rule — symptom-based, with a 'for' to avoid flapping
groups:
  - name: api-slo
    rules:
      - alert: HighErrorRateFastBurn
        # 14.4x burn of a 99.9% SLO over 1h means budget gone in ~2 days
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Fast error-budget burn on {{ $labels.service }}"
          runbook: "https://runbooks.internal/payments/error-budget"

# Alertmanager routing — page critical, Slack the rest; group + dedupe
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: slack-warnings
  routes:
    - matchers: [ severity="critical" ]
      receiver: pagerduty-oncall

Deduplication, Grouping & Silencing

When a database goes down, every one of fifty app servers may fire the same alert at once. Without help, that is fifty pages. An alert manager solves this with three mechanisms: deduplication collapses identical alerts into one, grouping bundles related alerts (same cluster, same cause) into a single notification, and silencing temporarily mutes known issues — during a planned deploy or maintenance window — so they don't page. The result is one meaningful notification instead of a storm.

Many duplicate alerts grouped and deduplicated into one page

Phase 1 fifty servers each fire the same alert → Phase 2 Alertmanager dedupes identical alerts and groups by cause → Phase 3 a single grouped notification pages on-call. Fifty raw alerts collapse to one actionable page.

Alert Fatigue & Reducing Noise

Alert fatigue is what happens when engineers receive so many alerts — especially false or non-actionable ones — that they start ignoring them. It is the single most dangerous failure mode in alerting, because the one real page gets lost in the noise. The cure is discipline: every alert must be actionable and tied to user impact, use a for duration to suppress flapping, prefer symptom over cause, alert on SLO burn rate rather than raw thresholds, and ruthlessly delete or downgrade alerts that page without anyone needing to act. A good rule of thumb: if an alert fires and the on-call response is "nothing to do, it recovered," that alert should not page.

Good Alerts vs Bad Alerts

Property	Good alert	Bad alert
Trigger	User-visible symptom (errors, latency)	Internal cause that may not matter (CPU 80%)
Actionability	Always has a clear action + runbook	Fires, then 'recovered, nothing to do'
Noise control	Has a 'for' duration; deduped & grouped	Flaps on every spike; one per server
Urgency	Severity matches real impact (burn rate)	Everything is 'critical', pages at 3am
Outcome	Trusted — responders act on it	Ignored — fuels alert fatigue

Runbooks

Every paging alert should link to a runbook — a short document that tells the responder what the alert means, how to confirm the problem, the first diagnostic steps, and how to mitigate. Runbooks turn a 3am page from a panic into a checklist, shrink mean-time-to-recovery, and let less-experienced engineers handle incidents. In Prometheus this is just a runbook annotation on the rule; in Grafana you can attach it to the alert. An alert without a runbook is a question with no answer attached.

Comparing Alert Strategies

Strategy	Fires when	Best for	Watch out for
Static threshold	Metric crosses a fixed line	Hard limits (disk full, queue cap)	Wrong line = noise or missed issues
Rate-of-change	Metric moves too fast	Sudden traffic or error spikes	Misses slow gradual degradation
Anomaly detection	Metric deviates from learned baseline	Seasonal traffic, unknown normal	False positives; complex to tune
SLO burn rate	Error budget burns too fast	User-facing reliability, fewer pages	Requires a defined SLO + good SLIs

Frequently Asked Questions

What is the difference between monitoring and alerting?

Monitoring is the continuous collection and storage of metrics, logs, and traces so you can observe system behavior. Alerting sits on top of monitoring: it evaluates rules against that data and notifies a human or automation when something needs attention. Monitoring tells you everything; alerting tells you only what is worth waking someone up for.

Why alert on SLO burn rate instead of a simple error-rate threshold?

A fixed error-rate threshold pages on every transient spike and ignores how much reliability budget you actually have left. Burn-rate alerting ties urgency to impact: a fast burn that would exhaust your monthly budget in hours pages loudly, while a slow burn opens a ticket. You get fewer, more meaningful pages and you are alerted before the SLO is actually violated rather than after.

How do I reduce alert fatigue?

Make every paging alert actionable and tied to user impact, alert on symptoms rather than causes, add a for duration to stop flapping, dedupe and group related alerts, silence known maintenance, and route by severity so only true emergencies page. Then review fired alerts regularly and delete or downgrade any that never required action. Quiet, trusted alerts are the goal.

The best alert is the one you trust enough to act on at 3am — and quiet enough that you never doubt it's real. Every page must be actionable, or it is just noise waiting to hide a real outage.
— alokknight Engineering