Erasure Coding in System Design: Reed-Solomon, Durability & Storage Overhead (Visualized)
Erasure coding splits data into k fragments plus m parity fragments so the original can be rebuilt from any k of the n pieces, giving replication-level durability at a fraction of the storage cost. This guide covers Reed-Solomon math, reconstruction, durability, and where it is used โ with live animations.
Erasure coding is a data-protection technique that breaks an object into k data fragments, computes m extra parity fragments, and stores all n = k + m fragments across different disks so the original object can be rebuilt from any k of the n fragments. It delivers the fault tolerance of replication while storing far fewer bytes.
The naive way to survive disk failure is replication: keep three full copies of every object. That tolerates two simultaneous failures but costs 3ร the raw storage. Erasure coding reaches the same โ or better โ durability while costing as little as 1.4ร, which is why it underpins almost every large-scale object store, from Amazon S3 to Backblaze.
How Reed-Solomon Coding Works
The most common erasure code in storage is Reed-Solomon. You choose a scheme written as RS(k, m) โ for example RS(6, 3). The object is split into k = 6 equal data fragments. The encoder then treats those fragments as coefficients and computes m = 3 parity fragments by multiplying them with a special matrix (a Vandermonde or Cauchy matrix) over a finite field, usually GF(2^8) so every byte stays a byte.
The key mathematical property is that the n = k + m fragments are linearly independent: any k of them form a solvable system of equations. So losing up to m fragments is always recoverable โ the decoder inverts the relevant rows of the matrix and reproduces the missing pieces exactly. There is no approximation; the rebuilt object is bit-for-bit identical.
Reconstruction From Any k of n
When a disk fails, the missing fragment is rebuilt on demand. The storage system gathers any k surviving fragments โ it does not matter whether they are data or parity โ and runs the decode step: invert the matrix rows corresponding to the fragments it has, then solve for the missing one. As long as at least k of the original n fragments survive, the object is fully recoverable.
Storage Overhead vs Replication
The storage overhead of an erasure code is simply n / k. An RS(6, 3) scheme stores 9 fragments for every 6 of data, an overhead of 1.5ร, yet tolerates 3 simultaneous failures. Compare that to 3ร replication, which costs twice as much raw storage to tolerate only 2 failures. Wider schemes such as RS(10, 4) push overhead down to 1.4ร while still surviving 4 losses.
| 3ร Replication | Erasure Coding | |
|---|---|---|
| Storage overhead | 3.0ร | 1.5ร (RS 6,3) |
| Failures tolerated | 2 | 3 |
| Read path | Read one copy | Read 1 fragment (degraded: read k) |
| Repair cost | Copy one full object | Read k fragments, recompute |
| CPU cost | Negligible | Encode/decode over GF(2^8) |
| Best for | Hot data, small objects | Cold/warm data, large objects |
Durability Math
An RS(k, m) object is lost only if more than m of its n fragments are lost before any can be repaired. Because fragments are placed on independent disks (and ideally independent racks and failure domains), losing a specific set of m + 1 fragments simultaneously is extraordinarily unlikely. This is what lets Amazon S3 advertise eleven nines (99.999999999%) of durability โ the probability of losing an object in a given year is around one in a hundred billion.
# RS(k, m): object survives if at most m of n fragments are lost.
# Rough annual object-loss probability given per-disk annual failure p.
from math import comb
def object_loss_prob(k, m, p):
n = k + m
# P(more than m fragments fail in a year)
return sum(comb(n, i) * p**i * (1 - p)**(n - i)
for i in range(m + 1, n + 1))
p = 0.02 # 2% annual disk failure rate
for scheme in [(6, 3), (10, 4), (12, 4)]:
k, m = scheme
loss = object_loss_prob(k, m, p)
print(f"RS({k},{m}): overhead={ (k+m)/k:.2f}x loss/yr={loss:.2e}")
# RS(6,3): overhead=1.50x loss/yr=2.18e-05
# RS(10,4): overhead=1.40x loss/yr=6.86e-07
# RS(12,4): overhead=1.33x loss/yr=1.52e-06In practice real systems do far better than this simple model because they repair failures within hours, shrinking the window in which a second or third failure can be fatal. The shorter the repair time, the higher the effective durability โ which is exactly why repair cost matters so much.
Where Erasure Coding Is Used
Erasure coding is everywhere at scale. HDFS added erasure coding in Hadoop 3 to cut the storage cost of cold data from 3ร to about 1.5ร. Ceph offers erasure-coded pools for object and block storage. Amazon S3, Azure Blob Storage, and Google Cloud Storage all use erasure coding internally to hit eleven-nines durability cheaply. Backblaze famously open-sourced its Reed-Solomon implementation and uses RS(17, 3) across its storage pods. Facebook's f4 warm-storage tier and Microsoft Azure's Local Reconstruction Codes (LRC) are further refinements that reduce repair traffic.
Trade-offs and Pitfalls
Repair amplification: rebuilding one lost fragment requires reading k fragments from the network, so a single disk failure generates many times its own size in repair traffic โ hard on the network and slow for wide schemes. Local Reconstruction Codes attack this by adding local parity groups so most single failures repair from a handful of nearby fragments. Read latency: a healthy read fetches one fragment, but a degraded read (when a fragment is missing) must fetch and decode k fragments, adding latency โ which is why hot data often stays replicated. CPU cost: encode/decode over a finite field burns CPU, though modern SIMD implementations make it cheap. Erasure coding shines for large, cold or warm objects; for tiny, latency-sensitive, frequently-mutated objects, replication is often still the better choice.
Frequently Asked Questions
What is the difference between erasure coding and RAID?
RAID is a special case of erasure coding applied within a single machine across local disks โ RAID 5 is essentially RS(n-1, 1) and RAID 6 is RS(n-2, 2). General erasure coding generalizes the idea to arbitrary k and m and spreads fragments across many machines, racks, and even data centers, which gives far better fault isolation than disks inside one chassis.
How many failures can RS(k, m) tolerate?
Exactly m simultaneous fragment losses. Because the object reconstructs from any k of the n = k + m fragments, you can afford to lose up to m of them and still recover bit-for-bit. Losing more than m before repair means the object is unrecoverable, so m is the durability knob you tune.
When should I use replication instead of erasure coding?
Use replication for hot, small, or latency-sensitive data where a single-fragment read and cheap repair matter more than storage cost. Use erasure coding for large, cold or warm objects where the 2ร storage savings dominate and the extra CPU and repair traffic are acceptable. Many systems do both: replicate hot data, then erasure-code it once it cools.
Replication buys durability with brute force; erasure coding buys the same durability with math โ and gives you back half your disks.
โ alokknight Engineering
