Load Shedding in System Design: Protecting Services Under Extreme Traffic
Load shedding is the deliberate rejection of lower-priority work when a system is overloaded, preserving its ability to serve critical traffic. This guide covers admission control, request prioritization, graceful degradation, and how load shedding compares to autoscaling and backpressure โ with live animations.
Load shedding is the practice of deliberately dropping or rejecting lower-priority requests when a service is overloaded, so that the system can continue to serve its most critical traffic instead of collapsing under the weight of everything at once.
Every production service has a finite capacity. Under normal conditions requests flow in, get processed, and responses go out. But during a traffic spike โ a flash sale, a viral post, a DDoS burst, a downstream cascade โ requests arrive faster than the system can handle. The naive response is to keep accepting everything and hope queues drain. That hope is almost always wrong: queues grow unbounded, latency skyrockets for every user, threads exhaust, the heap fills, and the entire service crashes. Zero users get served.
Load shedding rejects the premise. Instead of crashing, the system decides: "I can handle X requests per second at an acceptable latency. If I am receiving 3X, I will reject 2X of them right now โ preferably the least important ones โ and serve X well." Some users get a fast 503 Service Unavailable. The rest get a fast, correct response. The service survives. This is a conscious engineering trade-off: partial availability beats total outage.
Why Systems Collapse Without Load Shedding
When a server accepts more work than it can process, three failure modes converge. First, the request queue grows without bound, burning memory until the process is OOM-killed. Second, thread pools saturate โ new requests block waiting for a thread, adding latency for everyone. Third, the CPU spends increasing time context-switching between stuck threads rather than doing real work; throughput actually drops as concurrency rises. The system enters a death spiral: the longer it takes to serve requests, the more pile up, the slower it gets, the more pile up.
Empirically, the throughput curve of a real server looks like an inverted U: throughput rises as offered load increases, peaks at the system's saturation point, and then falls sharply as the overhead of managing overload swamps useful work. Load shedding is the mechanism that keeps the system operating near the peak โ and never to the right of it.
Admission Control: The Gate Mechanism
Admission control is the decision layer that sits at the entry point of a service and decides whether each incoming request should be allowed in or rejected outright. It is the concrete implementation of load shedding. Admission control can be stateless (check a rate counter, return 429 if over quota) or stateful (measure current CPU utilisation, queue depth, or in-flight request count, and reject when above a threshold).
Good admission control is fast: it must make the accept/reject decision in microseconds, before the request consumes any significant resources. Common signals used to trigger shedding include: CPU utilisation above 80%, request queue depth above N, p99 latency above the SLO budget, or in-flight request count above a configured concurrency limit. Google's client-side throttling (described in the SRE book) goes further โ clients measure their own success rate and pre-emptively stop sending when rejection climbs.
A minimal Python middleware that sheds by concurrency limit looks like this:
import threading
from http import HTTPStatus
MAX_INFLIGHT = 200 # tune to server capacity
_semaphore = threading.Semaphore(MAX_INFLIGHT)
def admission_middleware(request, call_next):
"""Reject requests when concurrency limit is hit."""
acquired = _semaphore.acquire(blocking=False)
if not acquired:
# shed: fast 503 before touching any business logic
return Response(
status_code=HTTPStatus.SERVICE_UNAVAILABLE,
headers={"Retry-After": "5"},
content="Service overloaded โ please retry",
)
try:
return call_next(request)
finally:
_semaphore.release()Request Prioritization: Choosing What to Shed
Not all requests are equal. A payment confirmation is worth more than a recommendation carousel. A health check is worth more than a batch analytics query. Request prioritization assigns each request a tier before it enters the system, so when capacity must be rationed, the lowest-priority tier is shed first.
Priority can come from many sources: the HTTP path (/checkout vs /recommendations), an authenticated user tier (paying vs free), a request header set at the API gateway, or a flag carried in an internal RPC context. The key principle is that the categorisation cost must be negligible compared to the work being protected. Typically you define three to five priority tiers:
Critical (P0): Never shed. Payments, authentication, health checks. High (P1): Shed last. Core product flows โ search, checkout, user profile reads. Medium (P2): Shed when load exceeds 75% capacity. Personalisation, recommendations, non-critical reads. Low (P3): Shed when load exceeds 50% capacity. Background analytics, batch jobs, prefetch calls. Deferrable (P4): Always shed first. Telemetry aggregation, cache warming, non-urgent webhooks.
Graceful Degradation: Doing Less, Not Dying
Graceful degradation is load shedding's close sibling. Where shedding rejects requests entirely, graceful degradation fulfils them in a reduced form. A social feed might return cached results instead of querying the database. A product page might omit the personalised recommendations section rather than waiting 800 ms for the ML service. An e-commerce site might disable real-time inventory counts and show "In stock" from a cached value instead.
The two techniques compose well: you shed the lowest-priority requests outright (returning 503), and you degrade the medium-priority ones (returning a cheaper version of the response). The result is a service whose user-facing quality degrades gracefully under pressure rather than snapping from "everything works" to "nothing works".
Load Shedding vs Autoscaling vs Backpressure
Engineers often ask: if I can just spin up more servers, why do I need load shedding? The answer is time. Autoscaling takes 60 to 300 seconds to provision, health-check, and register a new instance. A traffic spike can kill a service in under 10 seconds. Load shedding is the bridge โ it keeps the existing capacity healthy until new capacity arrives. Even with autoscaling, you still need shedding.
Backpressure is a related but distinct mechanism. Where shedding rejects at the entry point, backpressure propagates a slow-down signal from a congested downstream back to its callers, asking them to reduce their send rate. Think of a TCP receive window shrinking. Backpressure works well in internal pipelines and streaming systems where the caller can throttle voluntarily. Load shedding works at the edge, where callers (browsers, mobile apps, third-party services) cannot be relied upon to slow down.
| Load Shedding | Autoscaling | Backpressure | |
|---|---|---|---|
| What it does | Rejects work at entry | Adds capacity | Signals caller to slow down |
| Response time | Milliseconds | 1โ5 minutes | Milliseconds |
| Works without caller cooperation | Yes | Yes | No โ caller must honour signal |
| Reduces load on server | Yes โ immediately | No, until new nodes ready | Yes โ if caller complies |
| Best for | Edge / public-facing APIs | Sustained traffic growth | Internal pipelines, queues |
| Prerequisite | Priority classification | Cloud provider / orchestrator | Cooperative callers |
Implementing Load Shedding in Practice
A production-grade load shedding system has four components: a load signal (what metric triggers shedding), a classifier (which priority tier is each request), a policy (at what load thresholds to shed which tiers), and a response (how to reject gracefully with a 503 and a Retry-After header).
Choose your load signal carefully. CPU utilisation is a lagging indicator โ by the time CPU hits 90% the queue is already backing up. In-flight request count (concurrency) is more immediate: the service knows exactly how many requests it is currently processing, and a concurrency limiter can shed before queues form. Queue depth works well for async workers. Latency percentiles (p95, p99) are excellent signals but require a fast feedback loop.
Always return a well-formed error to shed clients. Include a Retry-After header so well-behaved clients back off rather than immediately retrying โ a retry storm can re-overload a recovering service, exactly undoing the shed. This is a critical detail: shedding without Retry-After can turn one spike into several.
Load Shedding Patterns in Real Systems
Several well-known patterns implement load shedding in different architectures. Token bucket / leaky bucket rate limiting at the API gateway is the most common โ each client gets a fixed token budget per second; requests beyond the budget are shed. It protects against a single noisy caller but does not distinguish between high and low priority within a client's budget.
The adaptive concurrency limiter (used by Netflix Concurrency Limits and AWS Adaptive Retry) automatically finds the right concurrency ceiling using latency feedback (similar to TCP AIMD congestion control). It raises the limit when latency is stable and lowers it when latency spikes, without requiring a manually tuned threshold. This is particularly powerful for services where capacity changes over time.
Circuit breakers are sometimes confused with load shedding but serve a different role: they monitor for downstream failures and stop sending traffic to a broken dependency (to avoid cascading failures), whereas load shedding monitors for current overload and stops accepting traffic at the current service's entry point. In a resilient system both patterns are active simultaneously.
| Pattern | Shed trigger | Granularity | Real-world examples |
|---|---|---|---|
| Rate limiting (token bucket) | Per-client quota exceeded | Per caller / per key | API Gateway, Nginx limit_req |
| Concurrency limiter | In-flight count > threshold | Per server instance | Netflix CL, Envoy, Gunicorn --worker-connections |
| Queue depth limiter | Queue length > N | Per worker pool | Celery task_always_eager, RabbitMQ x-max-length |
| Latency-based adaptive | p99 latency > SLO | Dynamic, self-tuning | Netflix Concurrency Limits, AWS Retry |
| CPU / resource limiter | CPU > X% for Y seconds | Per host | Nginx worker_processes, cgroups limits |
| Client-side throttling | Own error rate > threshold | Per client | Google SRE client throttle, gRPC hedging |
Common Pitfalls and How to Avoid Them
Shedding too aggressively: If thresholds are set too low, the system sheds healthy traffic and users notice degraded availability at normal load. Always measure your actual capacity under load before tuning thresholds, and set shedding thresholds at 70โ80% of measured saturation, not at theoretical maximum.
No retry budget: A 503 without Retry-After causes clients to immediately retry, tripling the load on a recovering service. Always include Retry-After: 5 (or an exponential backoff hint) in shed responses. Paired with jitter in the client, this breaks synchronised retry storms.
Shedding the wrong requests: If priority classification is wrong โ for example, marking all authenticated requests as P0 โ the shedding policy will protect the wrong work. Audit your priority assignments with real production traffic; the categories that matter are business-value categories, not technical categories.
Not observing what you shed: Increment a requests_shed_total{priority="p2"} counter on every shed. Alert when the shed rate exceeds X% of total traffic for more than Y seconds โ this is a canary signal that capacity is insufficient and autoscaling or provisioning changes are needed. Without metrics, shedding is invisible and the on-call engineer has no idea it is happening.
Frequently Asked Questions
Is load shedding the same as rate limiting?
They are related but not identical. Rate limiting caps the request rate of a specific caller or key over a time window (e.g. 100 requests per minute per API key) and is primarily a fairness and abuse-prevention tool. Load shedding is triggered by the server's own resource state โ CPU, concurrency, queue depth โ regardless of which caller is sending. Rate limiting is proactive and per-tenant; load shedding is reactive and system-wide. Production systems use both: rate limiting protects against abusive callers, load shedding protects against the server's own capacity ceiling.
What HTTP status code should a shed request return?
Use 503 Service Unavailable with a Retry-After header. 503 signals a transient server-side condition, which tells well-designed clients and proxies that a retry after the indicated delay is appropriate. Avoid 429 Too Many Requests for system-wide shedding โ that code implies the caller specifically exceeded a rate quota, which may not be true. Include a human-readable body explaining the situation so developers can understand what happened from logs and traces.
How do I know if my load shedding thresholds are correct?
Run load tests against a production-like environment to find your service's saturation point โ the offered load at which p99 latency exceeds your SLO or throughput stops increasing. Set your shedding threshold at 70โ80% of that saturation point to leave headroom for burst absorption. Then watch your requests_shed_total counter in production: if it fires regularly at normal traffic levels, the threshold is too low. If the server still struggles during spikes without shedding firing, the threshold is too high. Tune iteratively using real traffic data, not synthetic estimates.
A service that gracefully sheds 20% of requests under duress delivers ten times more value than one that silently degrades for 100% of users until it crashes. Design for partial availability โ it is the only kind that survives the real world.
โ alokknight Engineering
