Rate Limiting in System Design: Algorithms, 429s & Distributed Throttling (Visualized)
Rate limiting caps how many requests a client can make in a given time window, protecting services from abuse, overload, and runaway costs. This guide covers every major algorithm โ token bucket, leaky bucket, fixed window, sliding window โ plus 429 responses, Retry-After headers, and distributed enforcement with Redis.
Rate limiting is a traffic-control technique that caps the number of requests a client โ identified by IP address, API key, user ID, or any other dimension โ can make to a service within a defined time window, so that the service stays available and fair for everyone. Without it, a single misbehaving or compromised client can exhaust your CPU, memory, database connections, or third-party API budgets in seconds.
Rate limiting sits at the intersection of availability, security, and fairness. It stops brute-force login attacks, prevents one customer from monopolising a shared API tier, and gives your infrastructure a predictable load ceiling it can actually be sized for. Almost every public API โ Stripe, GitHub, Twitter, OpenAI โ publishes explicit rate limits, and violating them returns HTTP 429 Too Many Requests.
Where Rate Limiting Lives in Your Stack
You can enforce rate limits at multiple layers. An API gateway (Kong, AWS API Gateway, Nginx) is the most common place โ it intercepts every request before it hits your application, so badly-behaved clients never consume application threads. Application-level middleware is a second line of defence for fine-grained, business-logic-aware limits (e.g., an AI completion endpoint costs 20ร more than a lookup, so it gets a tighter cap). CDN edge nodes can drop obvious volumetric attacks before traffic even enters your data centre.
Algorithm 1: Token Bucket
The token bucket algorithm is the most widely used in production. Imagine a bucket that holds up to N tokens. A refill process adds tokens at a steady rate (e.g., 10 tokens per second). Each incoming request consumes one token. If the bucket has tokens, the request is allowed and a token is removed. If the bucket is empty, the request is rejected with 429 Too Many Requests. The bucket size controls the maximum burst a client can fire at once; the refill rate controls sustained throughput. AWS API Gateway, Stripe, and most cloud providers use token bucket semantics because it naturally handles bursts โ a client that has been quiet accumulates tokens and can then fire a burst up to the bucket capacity.
Token bucket has two key tunables: capacity (max burst) and refill rate (sustained RPS). A capacity of 100 and a refill rate of 10/s means a dormant client can fire 100 requests instantly but then only 10 per second thereafter. This is perfect for interactive APIs where clients batch work and then flush.
Algorithm 2: Leaky Bucket
The leaky bucket algorithm smooths bursty input into a perfectly steady output stream. Incoming requests are placed in a FIFO queue (the bucket). A processor drains the queue at a fixed rate โ say, 10 requests per second โ regardless of how fast requests arrive. If the queue fills up, new arrivals are dropped with 429. Unlike token bucket, leaky bucket guarantees a constant output rate, making it ideal for protecting downstream services that cannot handle bursts โ a payment processor, a third-party mailer, or a rate-limited upstream API. The downside is that even a single-client burst gets serialised and delayed rather than passed through immediately.
The leaky bucket's constant-rate output is what makes it so different from token bucket: no burst is ever forwarded, they are only queued and metered out. This protects fragile downstream systems at the cost of added latency during bursts. A typical implementation uses a Redis list as the queue and a background worker as the drain.
Algorithm 3: Fixed Window Counter
The fixed window counter divides time into non-overlapping windows โ say, one-minute buckets โ and keeps a counter per client per window. Every request increments the counter; when it exceeds the limit the client gets 429. At the window boundary the counter resets to zero. It is extremely cheap to store (one integer per client per window) and trivially implemented in Redis with INCR and EXPIRE. The fatal flaw is the boundary burst problem: a client can send the full quota in the last second of window N and the full quota again in the first second of window N+1, effectively doubling the allowed rate at the seam.
Algorithm 4: Sliding Window Log & Sliding Window Counter
The sliding window log stores a timestamp for every request in a sorted set. When a new request arrives, it first purges timestamps older than the window and counts what remains. If the count is below the limit, the request passes; otherwise 429. This is perfectly accurate โ there is no boundary burst โ but costs O(N) memory per client. The sliding window counter is a lightweight approximation: it keeps the current and previous fixed-window counters and interpolates a weighted estimate of requests in the rolling window, using only two integers per client. Google, Cloudflare, and most mature API gateways use the sliding window counter as the default because it balances accuracy and memory efficiency.
The 429 Response & Retry-After Header
When a request is rejected, the server should return HTTP 429 Too Many Requests with a Retry-After header telling the client how many seconds to wait before retrying. Good API design also includes an X-RateLimit-Limit header (the total cap), X-RateLimit-Remaining (tokens left in the current window), and X-RateLimit-Reset (Unix timestamp of the next window reset). These headers let clients implement self-throttling โ backing off gracefully rather than hammering the server and accumulating ban records. RFC 6585 formally defines 429; RFC 7231 defines Retry-After.
# FastAPI middleware โ token bucket rate limiter backed by Redis
import time
from fastapi import Request, HTTPException
from redis.asyncio import Redis
REDIS: Redis = None # injected at startup
LIMIT = 100 # requests per window
WINDOW = 60 # seconds
async def rate_limit_middleware(request: Request, call_next):
client_ip = request.client.host
key = f"rl:{client_ip}"
# Atomic increment + set TTL on first touch
pipe = REDIS.pipeline()
pipe.incr(key)
pipe.expire(key, WINDOW)
count, _ = await pipe.execute()
remaining = max(0, LIMIT - count)
reset_at = int(time.time()) + WINDOW
if count > LIMIT:
raise HTTPException(
status_code=429,
detail="Too Many Requests",
headers={
"Retry-After": str(WINDOW),
"X-RateLimit-Limit": str(LIMIT),
"X-RateLimit-Remaining": "0",
"X-RateLimit-Reset": str(reset_at),
},
)
response = await call_next(request)
response.headers["X-RateLimit-Limit"] = str(LIMIT)
response.headers["X-RateLimit-Remaining"] = str(remaining)
response.headers["X-RateLimit-Reset"] = str(reset_at)
return responsePer-User, Per-IP, and Per-Key Limits
Real APIs layer multiple dimensions of rate limiting simultaneously. Per-IP limits catch anonymous abuse before authentication but are easily bypassed with proxies. Per-API-key limits are the standard for public APIs โ they tie usage to a paying or authenticated customer. Per-user-ID limits work well for authenticated internal APIs. Per-endpoint limits let you apply tighter caps to expensive operations: a POST /ai/complete endpoint might cap at 10 req/min while GET /users/me allows 1000 req/min. In practice you compose these: global IP cap at the edge, then per-key cap at the gateway, then per-route cap in the application.
Distributed Rate Limiting with Redis
A single-node rate limiter breaks the moment you have more than one application server: each node tracks its own counter, so the global rate cap is multiplied by the number of instances. The fix is a shared, atomic store. Redis is the near-universal choice: its single-threaded command execution makes INCR and EXPIRE atomic without locks, and Lua scripts let you bundle a full sliding-window check into a single network round-trip. For extreme throughput (>1M req/s), teams use Redis Cluster for horizontal partitioning or a token bucket approach in a sidecar (like Envoy's rate-limit filter backed by ratelimit service). An alternative is approximate local rate limiting: each node limits at 1/N of the global cap (where N is the instance count, retrieved from a service registry), accepting slight over-admission in exchange for zero network overhead.
Algorithm Comparison
| Algorithm | Burst handling | Memory | Accuracy | Best for |
|---|---|---|---|---|
| Token Bucket | Allows bursts up to bucket capacity | O(1) per client | Exact | APIs with natural idle-then-burst clients |
| Leaky Bucket | Queues & smooths bursts | O(queue depth) | Exact rate out | Protecting fragile downstream services |
| Fixed Window Counter | Allows 2ร burst at boundary | O(1) per client | Approximate | Simple throttles where boundary edge is acceptable |
| Sliding Window Log | No boundary burst | O(requests in window) | Exact | Low-traffic, high-accuracy requirements |
| Sliding Window Counter | Minimal boundary effect | O(1) per client | ~99% accurate | High-traffic APIs (Cloudflare, Google default) |
Common Pitfalls
Not returning Retry-After: clients blindly retry immediately and amplify load instead of backing off. Always include the header. Rate limiting on a shared IP: NATed offices or mobile carriers may share one IP among thousands of users โ a per-IP cap will block legitimate users. Prefer per-key or per-session limits wherever possible. Forgetting distributed state: if your three app servers each allow 100 req/min, you have effectively set a 300 req/min global limit. Use Redis. Hard-coding limits: bake limits into configuration (environment variables or a feature-flag system) so you can tune them without a deployment. No allow-list: your own health-check endpoints, monitoring agents, and CI pipelines should bypass rate limits or you will get false positives in production dashboards.
Frequently Asked Questions
What is the difference between rate limiting and throttling?
The terms are often used interchangeably, but a useful distinction is that rate limiting rejects excess requests outright with 429, while throttling slows them down โ queuing or delaying them to enforce a maximum throughput. Leaky bucket is technically a throttling mechanism because it queues rather than drops. Most production systems use rate limiting (hard reject) at the edge for unauthenticated or abusive traffic and throttling (delay) internally for service-to-service calls where you want graceful degradation instead of hard failures.
Which rate limiting algorithm should I use in my API?
For most public REST APIs, start with the sliding window counter: it requires two Redis integers per client, is practically accurate, and eliminates the boundary-burst problem. Use token bucket when clients legitimately need burst capacity โ SDKs that batch writes, mobile apps syncing after a period offline. Use leaky bucket when your downstream (a payment gateway, SMS provider) enforces its own strict rate and you need to serialise calls to it. Avoid fixed window counter for any externally-facing API where adversarial boundary exploitation matters.
How does rate limiting protect against DDoS attacks?
Rate limiting is a first-line defence against volumetric and application-layer DDoS. By capping requests per IP or per key at the edge (CDN or API gateway), you prevent a flood of requests from reaching your origin servers. However, a sophisticated DDoS can originate from millions of unique IPs โ defeating simple per-IP limits. Production DDoS mitigation layers rate limiting with IP reputation databases, CAPTCHA challenges, BGP-level traffic scrubbing (Cloudflare, Akamai), and anycast routing to absorb traffic at the edge. Rate limiting alone is necessary but not sufficient for large-scale attacks; it buys time and filters low-sophistication attacks completely.
Rate limiting is not about punishing legitimate users โ it is about ensuring every user gets a fair share of a finite resource. Get the algorithm, the headers, and the distributed state right, and your API becomes both predictable and resilient.
โ alokknight Engineering
