Latency in System Design: Sources, Percentiles, Latency Numbers & Reduction Techniques (Visualized)

Latency is the elapsed time between a client sending a request and receiving the first byte of the response — the end-to-end delay experienced by a single operation. It differs from throughput, which measures how many operations a system completes per unit of time: a highway can move thousands of cars per hour (high throughput) while every individual car still takes 40 minutes to reach the destination (fixed latency). In distributed systems, latency is the dominant factor in user-perceived speed, and even small improvements compound into measurable business outcomes.

Google famously found that adding 400 ms to search results reduced traffic by 0.59%. Amazon reported that every 100 ms of added latency cost 1% in sales. These numbers explain why large engineering organizations dedicate entire teams to shaving milliseconds. To do that effectively you first need to know where the time goes.

Latency vs Throughput: Two Different Dimensions

Latency and throughput are related but independent. You can optimize throughput by batching small writes into one large flush — this raises throughput dramatically but increases latency because individual writes must wait. Conversely, you can minimize latency by flushing every write immediately, but each small write is expensive and throughput drops. Real system design constantly navigates this trade-off. The right balance depends on your access patterns: interactive UIs are latency-sensitive; bulk ETL pipelines are throughput-sensitive.

	Latency	Throughput
Definition	Time for one operation to complete	Operations completed per second
Unit	Milliseconds / microseconds	Requests/sec, MB/s
User impact	How fast does this page feel?	How many users can we serve simultaneously?
Optimized by	Fewer hops, caching, proximity	Parallelism, batching, bigger pipes
Trade-off	Low latency often reduces throughput	High throughput often increases latency

The Five Sources of Latency

Every millisecond in end-to-end latency can be traced to one of five sources. Understanding each one tells you exactly which lever to pull to fix it.

1. Propagation delay — the time for a signal to travel across a physical medium. Light in fiber optic cable travels at roughly 200,000 km/s (two-thirds the speed of light in vacuum). New York to London is ~5,600 km, so the theoretical minimum one-way propagation delay is about 28 ms. You cannot beat physics; you can only move your servers closer to users.

2. Transmission (serialization) delay — the time needed to push all bits of a packet onto the wire. A 1,500-byte Ethernet frame on a 1 Gbps link takes 12 microseconds. On a slower link (e.g., a 4G modem at 10 Mbps) the same frame takes 1.2 ms. Large payloads accumulate significant transmission delay.

3. Processing / CPU delay — time spent executing application code: parsing JSON, running business logic, encrypting a TLS record, rendering a template. This is the category most directly within the developer's control and the one that profiling tools expose most readily.

4. Storage (disk) delay — time for a read or write to persistent storage. An HDD random seek takes 4–10 ms. A modern NVMe SSD completes the same operation in 100–200 microseconds. An uncached database query that triggers a disk read is almost always the single largest latency contributor in web applications.

5. Queueing delay — time a request spends waiting in line because a resource (CPU, disk, network buffer, thread pool) is busy. Queueing delay is the sneakiest source: at low utilization it is near zero, but as utilization crosses ~70–80% it explodes non-linearly (described by Little's Law and the M/M/1 queue model). This is why systems that are "only 80% utilized" still feel slow under spiky load.

Request Latency Accumulation Across Hops

Watch a single request travel client → network → queue → server → database and back, with a running millisecond counter showing where each delay accumulates.

Latency Numbers Every Programmer Should Know

Jeff Dean popularized a table of reference latency numbers that engineers should have memorized. These are order-of-magnitude estimates; actual values vary with hardware generation, but the relative proportions have stayed remarkably stable over time. The key insight is the seven orders of magnitude between an L1 cache hit (0.5 ns) and a cross-region round trip (150 ms): reaching across a continent is 300 million times slower than reading from the register-closest cache.

Operation	Approx. Latency	Relative to L1 cache
L1 cache hit	0.5 ns	1×
Branch mispredict	5 ns	10×
L2 cache hit	7 ns	14×
Mutex lock/unlock	25 ns	50×
Main memory (RAM) access	100 ns	200×
Compress 1 KB (Snappy)	3,000 ns (3 µs)	6,000×
Read 1 MB sequentially from RAM	250,000 ns (250 µs)	500,000×
SSD random read (NVMe)	100–200 µs	~300,000×
HDD random read (seek)	4–10 ms	~10,000,000×
Same-datacenter round trip	0.5 ms	1,000,000×
Cross-region round trip (US→EU)	~150 ms	300,000,000×

Relative Scale of Latency Numbers

Proportional animated bars comparing L1 cache, RAM, SSD, same-DC, and cross-region latency on a logarithmic scale. Each bar animates to its true relative magnitude.

Measuring Latency Correctly: Percentiles and Tail Latency

Averages lie. If 99 requests finish in 1 ms and one request takes 10,000 ms, the average is ~101 ms — a number that describes none of the actual experiences. The solution is percentiles. The p50 (median) tells you the experience of the typical user. The p95 tells you the experience of 1 in 20 users. The p99 tells you the worst 1 in 100 users. In high-traffic systems, if your p99 is 2 seconds and you serve 10 million requests per day, that is 100,000 people per day getting a 2-second experience.

Tail latency is the name for the high-percentile (p99, p99.9) region of the distribution. It matters in microservice architectures because of the fan-out problem: if a single user request triggers calls to 100 downstream services in parallel, the end-to-end response time is the maximum of all 100 calls — not the average. Even if each individual service has a 1% chance of a slow call, the probability of at least one slow call in 100 is 1 - 0.99^100 ≈ 63%. Tail latency from any one service dominates the overall experience.

Common causes of tail latency include: garbage collection pauses, lock contention, disk I/O on the critical path, resource exhaustion causing queueing, and cold-start effects after a server is added to a pool. Techniques like hedged requests (send the same request to two replicas after a brief delay; use whichever responds first) and request timeouts with retries can tame the tail.

Latency Distribution: p50 vs p95 vs p99 Tail

A live histogram of request latencies. Most requests cluster at the fast p50 median, but a long tail extends to p99. Vertical markers show each percentile boundary.

Techniques to Reduce Latency

Armed with knowledge of where time goes, here are the main reduction techniques ordered from most to least impactful for typical web systems:

Caching

Caching is the single most effective latency reduction technique. Storing frequently-read data in memory (Redis, Memcached, or application-local in-process caches) replaces a 4–10 ms disk read with a sub-millisecond memory lookup. Cache hit ratios above 90% can reduce median database latency by 10× or more. The key cache design questions are what to cache (hot reads, pre-computed aggregates), when to invalidate (TTL, write-through, write-behind), and cache stampede prevention (lock-based or probabilistic early expiry).

CDN and Geographic Proximity

A Content Delivery Network (CDN) caches static assets (JS, CSS, images, videos) at Points of Presence (PoPs) close to end users. A user in Tokyo hitting a CDN PoP 10 ms away instead of an origin server in Virginia 180 ms away saves 340 ms per round trip. For dynamic content, techniques like edge computing (Cloudflare Workers, AWS Lambda@Edge) allow even personalized responses to be generated close to the user. The bottom line: physics is the enemy; proximity is the weapon.

Connection Reuse and Protocol Optimization

Every new TCP connection incurs a 1.5 round-trip TCP handshake cost before any data flows. A TLS handshake adds another 1–2 round trips (TLS 1.2) or 1 round trip (TLS 1.3). HTTP keep-alive, connection pooling at the database and service levels, and upgrading to HTTP/2 (multiplexed streams over one connection) or HTTP/3/QUIC (no head-of-line blocking, 0-RTT reconnections) all reduce connection overhead. In a microservice mesh with dozens of internal calls per request, poor connection reuse multiplies these costs quickly.

Async Processing and Parallelism

If a request requires multiple independent downstream calls, making them in parallel reduces total latency from the sum to the maximum. If a user registration endpoint must call a user service, a preferences service, and send a welcome email, doing these in parallel (async/await with Promise.all, Go goroutines, etc.) cuts response time from 150 ms to 60 ms. Separately, offloading non-critical work to background queues (send the welcome email after returning a 201) removes it from the critical path entirely.

Data Locality

Data locality means placing computation close to the data instead of moving the data to the computation. Read replicas in the same region as your application servers, database sharding strategies that co-locate related rows, and denormalized views that pre-join frequently queried data all reduce the number of round trips and the distance data must travel. The N+1 query problem — fetching a list of 100 users and then making 100 separate queries for their addresses — is a classic data-locality failure solved by a single JOIN or a bulk IN clause.

Batching

Batching amortizes per-operation overhead across multiple operations. Instead of writing 1,000 rows one at a time (1,000 round trips to the database × 0.5 ms = 500 ms), a single batched INSERT does the same work in one round trip. The DataLoader pattern (popularized by GraphQL) batches many small reads within a single request tick into one bulk query. Batching trades individual latency for overall efficiency — the right tool when throughput matters more than single-operation response time.

Technique	What it attacks	Typical gain	Trade-off
In-memory caching	Disk I/O latency	10–100×	Stale data risk, memory cost
CDN / Edge	Propagation delay	2–10×	Cache invalidation complexity
Connection pooling	TCP/TLS handshake overhead	2–5×	Pool configuration tuning
HTTP/2 or HTTP/3	Head-of-line blocking	1.5–3×	Requires server/client support
Parallel async calls	Sequential fan-out	N× for N parallel calls	Error handling complexity
Batching	Per-request overhead	10–1000× for bulk ops	Increased individual latency
Data locality / co-location	Cross-DC propagation	5–50×	Operational complexity
Hedged requests	Tail latency	p99 → p50	Extra server load (~1–5%)

Frequently Asked Questions

What is the difference between latency and response time?

The terms are often used interchangeably, but there is a technical distinction. Latency strictly refers to the delay introduced by a system — the time a request spends "in flight" or waiting. Response time is the end-to-end duration from the client's perspective, which includes latency plus the service time (actual computation). In practice, engineers say "latency" when they mean the response time measured at the client or load balancer. The key is to be consistent within your team and your monitoring dashboards so you are always comparing the same measurement.

Why does tail latency get worse as systems scale?

As fan-out increases — more microservices, more parallel database shards, more cache nodes — the probability that at least one call lands in a slow percentile grows. If each of 50 services has a 1% chance of a 500 ms response, the overall p99 of the entire call graph is effectively the p50 of the slowest individual service, not its p99. Additionally, at high scale, shared resources (CPU caches, network switch buffers, OS scheduler) experience more contention, widening the latency distribution. This is why Google's internal infrastructure papers emphasize p99 and p99.9 measurement even for internal RPCs.

How should I set latency SLOs for my service?

Start from user research and product context rather than technical capacity. For interactive UIs, 100 ms feels instant, 1 second is the limit of seamless flow, and 10 seconds loses the user's attention (the Doherty Threshold / Nielsen's limits). A common pattern is to set your p50 SLO at the user-perceived "instant" threshold, your p95 SLO at "acceptable," and your p99 SLO at the absolute maximum before you consider the request failed. Track error budgets: if your 30-day p99 SLO is 500 ms, every minute above that burns budget. Protect the budget by investing in tail-latency reduction when it starts running low.

Average latency is a lie your metrics tell you. Measure p99. Optimize for the tail. The slowest experience your system delivers is the one users remember.
— alokknight Engineering