Latency in System Design: Sources, Percentiles, Latency Numbers & Reduction Techniques (Visualized)
Latency is the time between sending a request and receiving the first byte of a response. Understanding what causes it, how to measure it correctly with percentiles, and how to systematically reduce it is one of the most important skills in building fast, user-friendly systems.
Latency is the elapsed time between a client sending a request and receiving the first byte of the response โ the end-to-end delay experienced by a single operation. It differs from throughput, which measures how many operations a system completes per unit of time: a highway can move thousands of cars per hour (high throughput) while every individual car still takes 40 minutes to reach the destination (fixed latency). In distributed systems, latency is the dominant factor in user-perceived speed, and even small improvements compound into measurable business outcomes.
Google famously found that adding 400 ms to search results reduced traffic by 0.59%. Amazon reported that every 100 ms of added latency cost 1% in sales. These numbers explain why large engineering organizations dedicate entire teams to shaving milliseconds. To do that effectively you first need to know where the time goes.
Latency vs Throughput: Two Different Dimensions
Latency and throughput are related but independent. You can optimize throughput by batching small writes into one large flush โ this raises throughput dramatically but increases latency because individual writes must wait. Conversely, you can minimize latency by flushing every write immediately, but each small write is expensive and throughput drops. Real system design constantly navigates this trade-off. The right balance depends on your access patterns: interactive UIs are latency-sensitive; bulk ETL pipelines are throughput-sensitive.
| Latency | Throughput | |
|---|---|---|
| Definition | Time for one operation to complete | Operations completed per second |
| Unit | Milliseconds / microseconds | Requests/sec, MB/s |
| User impact | How fast does this page feel? | How many users can we serve simultaneously? |
| Optimized by | Fewer hops, caching, proximity | Parallelism, batching, bigger pipes |
| Trade-off | Low latency often reduces throughput | High throughput often increases latency |
The Five Sources of Latency
Every millisecond in end-to-end latency can be traced to one of five sources. Understanding each one tells you exactly which lever to pull to fix it.
1. Propagation delay โ the time for a signal to travel across a physical medium. Light in fiber optic cable travels at roughly 200,000 km/s (two-thirds the speed of light in vacuum). New York to London is ~5,600 km, so the theoretical minimum one-way propagation delay is about 28 ms. You cannot beat physics; you can only move your servers closer to users.
2. Transmission (serialization) delay โ the time needed to push all bits of a packet onto the wire. A 1,500-byte Ethernet frame on a 1 Gbps link takes 12 microseconds. On a slower link (e.g., a 4G modem at 10 Mbps) the same frame takes 1.2 ms. Large payloads accumulate significant transmission delay.
3. Processing / CPU delay โ time spent executing application code: parsing JSON, running business logic, encrypting a TLS record, rendering a template. This is the category most directly within the developer's control and the one that profiling tools expose most readily.
4. Storage (disk) delay โ time for a read or write to persistent storage. An HDD random seek takes 4โ10 ms. A modern NVMe SSD completes the same operation in 100โ200 microseconds. An uncached database query that triggers a disk read is almost always the single largest latency contributor in web applications.
5. Queueing delay โ time a request spends waiting in line because a resource (CPU, disk, network buffer, thread pool) is busy. Queueing delay is the sneakiest source: at low utilization it is near zero, but as utilization crosses ~70โ80% it explodes non-linearly (described by Little's Law and the M/M/1 queue model). This is why systems that are "only 80% utilized" still feel slow under spiky load.
Latency Numbers Every Programmer Should Know
Jeff Dean popularized a table of reference latency numbers that engineers should have memorized. These are order-of-magnitude estimates; actual values vary with hardware generation, but the relative proportions have stayed remarkably stable over time. The key insight is the seven orders of magnitude between an L1 cache hit (0.5 ns) and a cross-region round trip (150 ms): reaching across a continent is 300 million times slower than reading from the register-closest cache.
| Operation | Approx. Latency | Relative to L1 cache |
|---|---|---|
| L1 cache hit | 0.5 ns | 1ร |
| Branch mispredict | 5 ns | 10ร |
| L2 cache hit | 7 ns | 14ร |
| Mutex lock/unlock | 25 ns | 50ร |
| Main memory (RAM) access | 100 ns | 200ร |
| Compress 1 KB (Snappy) | 3,000 ns (3 ยตs) | 6,000ร |
| Read 1 MB sequentially from RAM | 250,000 ns (250 ยตs) | 500,000ร |
| SSD random read (NVMe) | 100โ200 ยตs | ~300,000ร |
| HDD random read (seek) | 4โ10 ms | ~10,000,000ร |
| Same-datacenter round trip | 0.5 ms | 1,000,000ร |
| Cross-region round trip (USโEU) | ~150 ms | 300,000,000ร |
Measuring Latency Correctly: Percentiles and Tail Latency
Averages lie. If 99 requests finish in 1 ms and one request takes 10,000 ms, the average is ~101 ms โ a number that describes none of the actual experiences. The solution is percentiles. The p50 (median) tells you the experience of the typical user. The p95 tells you the experience of 1 in 20 users. The p99 tells you the worst 1 in 100 users. In high-traffic systems, if your p99 is 2 seconds and you serve 10 million requests per day, that is 100,000 people per day getting a 2-second experience.
Tail latency is the name for the high-percentile (p99, p99.9) region of the distribution. It matters in microservice architectures because of the fan-out problem: if a single user request triggers calls to 100 downstream services in parallel, the end-to-end response time is the maximum of all 100 calls โ not the average. Even if each individual service has a 1% chance of a slow call, the probability of at least one slow call in 100 is 1 - 0.99^100 โ 63%. Tail latency from any one service dominates the overall experience.
Common causes of tail latency include: garbage collection pauses, lock contention, disk I/O on the critical path, resource exhaustion causing queueing, and cold-start effects after a server is added to a pool. Techniques like hedged requests (send the same request to two replicas after a brief delay; use whichever responds first) and request timeouts with retries can tame the tail.
Techniques to Reduce Latency
Armed with knowledge of where time goes, here are the main reduction techniques ordered from most to least impactful for typical web systems:
Caching
Caching is the single most effective latency reduction technique. Storing frequently-read data in memory (Redis, Memcached, or application-local in-process caches) replaces a 4โ10 ms disk read with a sub-millisecond memory lookup. Cache hit ratios above 90% can reduce median database latency by 10ร or more. The key cache design questions are what to cache (hot reads, pre-computed aggregates), when to invalidate (TTL, write-through, write-behind), and cache stampede prevention (lock-based or probabilistic early expiry).
CDN and Geographic Proximity
A Content Delivery Network (CDN) caches static assets (JS, CSS, images, videos) at Points of Presence (PoPs) close to end users. A user in Tokyo hitting a CDN PoP 10 ms away instead of an origin server in Virginia 180 ms away saves 340 ms per round trip. For dynamic content, techniques like edge computing (Cloudflare Workers, AWS Lambda@Edge) allow even personalized responses to be generated close to the user. The bottom line: physics is the enemy; proximity is the weapon.
Connection Reuse and Protocol Optimization
Every new TCP connection incurs a 1.5 round-trip TCP handshake cost before any data flows. A TLS handshake adds another 1โ2 round trips (TLS 1.2) or 1 round trip (TLS 1.3). HTTP keep-alive, connection pooling at the database and service levels, and upgrading to HTTP/2 (multiplexed streams over one connection) or HTTP/3/QUIC (no head-of-line blocking, 0-RTT reconnections) all reduce connection overhead. In a microservice mesh with dozens of internal calls per request, poor connection reuse multiplies these costs quickly.
Async Processing and Parallelism
If a request requires multiple independent downstream calls, making them in parallel reduces total latency from the sum to the maximum. If a user registration endpoint must call a user service, a preferences service, and send a welcome email, doing these in parallel (async/await with Promise.all, Go goroutines, etc.) cuts response time from 150 ms to 60 ms. Separately, offloading non-critical work to background queues (send the welcome email after returning a 201) removes it from the critical path entirely.
Data Locality
Data locality means placing computation close to the data instead of moving the data to the computation. Read replicas in the same region as your application servers, database sharding strategies that co-locate related rows, and denormalized views that pre-join frequently queried data all reduce the number of round trips and the distance data must travel. The N+1 query problem โ fetching a list of 100 users and then making 100 separate queries for their addresses โ is a classic data-locality failure solved by a single JOIN or a bulk IN clause.
Batching
Batching amortizes per-operation overhead across multiple operations. Instead of writing 1,000 rows one at a time (1,000 round trips to the database ร 0.5 ms = 500 ms), a single batched INSERT does the same work in one round trip. The DataLoader pattern (popularized by GraphQL) batches many small reads within a single request tick into one bulk query. Batching trades individual latency for overall efficiency โ the right tool when throughput matters more than single-operation response time.
| Technique | What it attacks | Typical gain | Trade-off |
|---|---|---|---|
| In-memory caching | Disk I/O latency | 10โ100ร | Stale data risk, memory cost |
| CDN / Edge | Propagation delay | 2โ10ร | Cache invalidation complexity |
| Connection pooling | TCP/TLS handshake overhead | 2โ5ร | Pool configuration tuning |
| HTTP/2 or HTTP/3 | Head-of-line blocking | 1.5โ3ร | Requires server/client support |
| Parallel async calls | Sequential fan-out | Nร for N parallel calls | Error handling complexity |
| Batching | Per-request overhead | 10โ1000ร for bulk ops | Increased individual latency |
| Data locality / co-location | Cross-DC propagation | 5โ50ร | Operational complexity |
| Hedged requests | Tail latency | p99 โ p50 | Extra server load (~1โ5%) |
Frequently Asked Questions
What is the difference between latency and response time?
The terms are often used interchangeably, but there is a technical distinction. Latency strictly refers to the delay introduced by a system โ the time a request spends "in flight" or waiting. Response time is the end-to-end duration from the client's perspective, which includes latency plus the service time (actual computation). In practice, engineers say "latency" when they mean the response time measured at the client or load balancer. The key is to be consistent within your team and your monitoring dashboards so you are always comparing the same measurement.
Why does tail latency get worse as systems scale?
As fan-out increases โ more microservices, more parallel database shards, more cache nodes โ the probability that at least one call lands in a slow percentile grows. If each of 50 services has a 1% chance of a 500 ms response, the overall p99 of the entire call graph is effectively the p50 of the slowest individual service, not its p99. Additionally, at high scale, shared resources (CPU caches, network switch buffers, OS scheduler) experience more contention, widening the latency distribution. This is why Google's internal infrastructure papers emphasize p99 and p99.9 measurement even for internal RPCs.
How should I set latency SLOs for my service?
Start from user research and product context rather than technical capacity. For interactive UIs, 100 ms feels instant, 1 second is the limit of seamless flow, and 10 seconds loses the user's attention (the Doherty Threshold / Nielsen's limits). A common pattern is to set your p50 SLO at the user-perceived "instant" threshold, your p95 SLO at "acceptable," and your p99 SLO at the absolute maximum before you consider the request failed. Track error budgets: if your 30-day p99 SLO is 500 ms, every minute above that burns budget. Protect the budget by investing in tail-latency reduction when it starts running low.
Average latency is a lie your metrics tell you. Measure p99. Optimize for the tail. The slowest experience your system delivers is the one users remember.
โ alokknight Engineering
