Erasure Coding in System Design: Reed-Solomon, Durability & Storage Overhead (Visualized)

Erasure coding is a data-protection technique that breaks an object into k data fragments, computes m extra parity fragments, and stores all n = k + m fragments across different disks so the original object can be rebuilt from any k of the n fragments. It delivers the fault tolerance of replication while storing far fewer bytes.

The naive way to survive disk failure is replication: keep three full copies of every object. That tolerates two simultaneous failures but costs 3× the raw storage. Erasure coding reaches the same — or better — durability while costing as little as 1.4×, which is why it underpins almost every large-scale object store, from Amazon S3 to Backblaze.

How Reed-Solomon Coding Works

The most common erasure code in storage is Reed-Solomon. You choose a scheme written as RS(k, m) — for example RS(6, 3). The object is split into k = 6 equal data fragments. The encoder then treats those fragments as coefficients and computes m = 3 parity fragments by multiplying them with a special matrix (a Vandermonde or Cauchy matrix) over a finite field, usually GF(2^8) so every byte stays a byte.

The key mathematical property is that the n = k + m fragments are linearly independent: any k of them form a solvable system of equations. So losing up to m fragments is always recoverable — the decoder inverts the relevant rows of the matrix and reproduces the missing pieces exactly. There is no approximation; the rebuilt object is bit-for-bit identical.

A file split into k data + m parity fragments

An RS(6, 3) scheme: the object is split into 6 data fragments, 3 parity fragments are computed, and all 9 are spread across separate disks.

Reconstruction From Any k of n

When a disk fails, the missing fragment is rebuilt on demand. The storage system gathers any k surviving fragments — it does not matter whether they are data or parity — and runs the decode step: invert the matrix rows corresponding to the fragments it has, then solve for the missing one. As long as at least k of the original n fragments survive, the object is fully recoverable.

A disk fails and the fragment is rebuilt from survivors

Disk 4 fails (red). The system reads any k = 6 surviving fragments, decodes, and writes a fresh replacement fragment — no data lost.

Storage Overhead vs Replication

The storage overhead of an erasure code is simply n / k. An RS(6, 3) scheme stores 9 fragments for every 6 of data, an overhead of 1.5×, yet tolerates 3 simultaneous failures. Compare that to 3× replication, which costs twice as much raw storage to tolerate only 2 failures. Wider schemes such as RS(10, 4) push overhead down to 1.4× while still surviving 4 losses.

Raw storage: 3x replication vs erasure coding

To store the same 6 units of data, 3x replication consumes 18 units of disk; RS(6,3) consumes only 9 — half the cost — and tolerates more failures.

	3× Replication	Erasure Coding
Storage overhead	3.0×	1.5× (RS 6,3)
Failures tolerated	2	3
Read path	Read one copy	Read 1 fragment (degraded: read k)
Repair cost	Copy one full object	Read k fragments, recompute
CPU cost	Negligible	Encode/decode over GF(2^8)
Best for	Hot data, small objects	Cold/warm data, large objects

Durability Math

An RS(k, m) object is lost only if more than m of its n fragments are lost before any can be repaired. Because fragments are placed on independent disks (and ideally independent racks and failure domains), losing a specific set of m + 1 fragments simultaneously is extraordinarily unlikely. This is what lets Amazon S3 advertise eleven nines (99.999999999%) of durability — the probability of losing an object in a given year is around one in a hundred billion.

# RS(k, m): object survives if at most m of n fragments are lost.
# Rough annual object-loss probability given per-disk annual failure p.
from math import comb

def object_loss_prob(k, m, p):
    n = k + m
    # P(more than m fragments fail in a year)
    return sum(comb(n, i) * p**i * (1 - p)**(n - i)
               for i in range(m + 1, n + 1))

p = 0.02  # 2% annual disk failure rate
for scheme in [(6, 3), (10, 4), (12, 4)]:
    k, m = scheme
    loss = object_loss_prob(k, m, p)
    print(f"RS({k},{m}): overhead={ (k+m)/k:.2f}x  loss/yr={loss:.2e}")
# RS(6,3):  overhead=1.50x  loss/yr=2.18e-05
# RS(10,4): overhead=1.40x  loss/yr=6.86e-07
# RS(12,4): overhead=1.33x  loss/yr=1.52e-06

In practice real systems do far better than this simple model because they repair failures within hours, shrinking the window in which a second or third failure can be fatal. The shorter the repair time, the higher the effective durability — which is exactly why repair cost matters so much.

Where Erasure Coding Is Used

Erasure coding is everywhere at scale. HDFS added erasure coding in Hadoop 3 to cut the storage cost of cold data from 3× to about 1.5×. Ceph offers erasure-coded pools for object and block storage. Amazon S3, Azure Blob Storage, and Google Cloud Storage all use erasure coding internally to hit eleven-nines durability cheaply. Backblaze famously open-sourced its Reed-Solomon implementation and uses RS(17, 3) across its storage pods. Facebook's f4 warm-storage tier and Microsoft Azure's Local Reconstruction Codes (LRC) are further refinements that reduce repair traffic.

Trade-offs and Pitfalls

Repair amplification: rebuilding one lost fragment requires reading k fragments from the network, so a single disk failure generates many times its own size in repair traffic — hard on the network and slow for wide schemes. Local Reconstruction Codes attack this by adding local parity groups so most single failures repair from a handful of nearby fragments. Read latency: a healthy read fetches one fragment, but a degraded read (when a fragment is missing) must fetch and decode k fragments, adding latency — which is why hot data often stays replicated. CPU cost: encode/decode over a finite field burns CPU, though modern SIMD implementations make it cheap. Erasure coding shines for large, cold or warm objects; for tiny, latency-sensitive, frequently-mutated objects, replication is often still the better choice.

Frequently Asked Questions

What is the difference between erasure coding and RAID?

RAID is a special case of erasure coding applied within a single machine across local disks — RAID 5 is essentially RS(n-1, 1) and RAID 6 is RS(n-2, 2). General erasure coding generalizes the idea to arbitrary k and m and spreads fragments across many machines, racks, and even data centers, which gives far better fault isolation than disks inside one chassis.

How many failures can RS(k, m) tolerate?

Exactly m simultaneous fragment losses. Because the object reconstructs from any k of the n = k + m fragments, you can afford to lose up to m of them and still recover bit-for-bit. Losing more than m before repair means the object is unrecoverable, so m is the durability knob you tune.

When should I use replication instead of erasure coding?

Use replication for hot, small, or latency-sensitive data where a single-fragment read and cheap repair matter more than storage cost. Use erasure coding for large, cold or warm objects where the 2× storage savings dominate and the extra CPU and repair traffic are acceptable. Many systems do both: replicate hot data, then erasure-code it once it cools.

Replication buys durability with brute force; erasure coding buys the same durability with math — and gives you back half your disks.
— alokknight Engineering