Consensus in Distributed Systems: Raft & Paxos Explained (Visualized)
Consensus is how a cluster of unreliable machines agrees on a single value despite crashes and network delays. This guide covers quorums and majorities, Raft leader election and log replication, Paxos at a high level, split-brain, the FLP result, and real systems like etcd and ZooKeeper โ with live animations of each idea.
Consensus is the problem of getting a group of independent machines to agree on a single value โ and to keep agreeing on it โ even when some of those machines crash and the network drops or delays messages. It is the bedrock primitive behind leader election, distributed locks, configuration stores, and replicated state machines.
If a single database holds the truth, agreement is trivial โ there is nothing to agree with. The moment you replicate that database across three or five nodes for durability and availability, you face a hard question: when two clients issue conflicting writes, or a node comes back from a network partition with stale data, which value is the real one? Consensus algorithms answer that question with a provable guarantee.
The Problem: One Value, Many Failures
A correct consensus protocol must satisfy three properties under the failure model it assumes. Agreement: no two non-faulty nodes decide different values. Validity: the decided value was actually proposed by some node (you cannot invent a value). Termination: every non-faulty node eventually decides. Raft and Paxos target the crash-fault model โ nodes may stop, restart, or lose messages, but they do not lie. Tolerating malicious or arbitrary behavior is the harder Byzantine problem solved by protocols like PBFT.
Quorums and the Majority Rule
The trick that makes consensus possible is the quorum: a subset of nodes large enough that any two quorums must overlap in at least one node. For a cluster of N nodes, a majority quorum is floor(N/2) + 1. Because two majorities always share a member, and that shared member remembers what it already agreed to, the cluster can never commit two conflicting decisions. Quorum intersection is the entire reason a majority-based system stays consistent across partitions.
Why an Odd Number of Nodes?
You almost always run consensus clusters with an odd node count โ 3, 5, or 7. A 3-node cluster tolerates 1 failure and a 4-node cluster also tolerates only 1, because both need a majority of 3 to survive. The fourth node adds cost and a second machine that can fail without buying you any extra fault tolerance. Odd sizes give you the best fault tolerance per node and avoid even-split ties.
| Cluster size N | Majority quorum | Failures tolerated |
|---|---|---|
| 1 | 1 | 0 |
| 3 | 2 | 1 |
| 4 | 3 | 1 |
| 5 | 3 | 2 |
| 6 | 4 | 2 |
| 7 | 4 | 3 |
Raft: Consensus You Can Understand
Raft was designed at Stanford explicitly to be easier to understand than Paxos, and it has become the default for new systems. It decomposes consensus into three sub-problems: leader election, log replication, and safety. At any moment a node is a leader, a follower, or a candidate, and all writes flow through a single elected leader. Time is divided into terms, monotonically increasing integers that act as a logical clock and let nodes detect stale leaders.
Raft Leader Election
Followers expect a periodic heartbeat from the leader. Each follower runs a randomized election timeout (typically 150โ300 ms). If that timer fires without a heartbeat, the follower increments its term, becomes a candidate, votes for itself, and sends RequestVote RPCs to everyone. A node grants its vote at most once per term, so only one candidate can collect a majority. The winner becomes leader and immediately starts sending heartbeats to suppress further elections. Randomized timeouts make simultaneous candidacies rare, so elections usually resolve in one round.
Raft Log Replication and Commit
Once a leader exists, every client command becomes a new entry appended to the leader's log. The leader sends AppendEntries RPCs to replicate the entry to followers. An entry is committed the instant the leader knows a majority of nodes have stored it โ at that point it is durable even if the leader dies, because any future leader must contain it. Only then does the leader apply the command to its state machine and reply to the client. Each entry carries the leader's term and an index, and a consistency check ensures followers' logs never diverge silently.
Paxos at a High Level
Paxos, introduced by Leslie Lamport, is the original proven solution to consensus and the intellectual ancestor of Raft. Single-decree Paxos has three roles โ proposers, acceptors, and learners โ and runs in two phases. In Phase 1 (prepare), a proposer picks a unique, increasing proposal number n and asks a majority of acceptors to promise not to accept anything numbered below n; acceptors reply with any value they have already accepted. In Phase 2 (accept), the proposer sends an accept request for either its own value or the highest-numbered value it learned about, and once a majority accepts, the value is chosen.
Plain Paxos decides a single value; real systems need a continuous stream of decisions, so they use Multi-Paxos, which elects a stable leader to skip Phase 1 on the common path โ converging on essentially the same design Raft formalizes. Paxos is famously subtle to implement correctly, which is exactly why Raft was created. The pseudocode below sketches the acceptor's safety logic.
# Paxos acceptor: the heart of the safety guarantee
class Acceptor:
def __init__(self):
self.promised_n = None # highest proposal number promised
self.accepted_n = None # number of the accepted proposal
self.accepted_v = None # the accepted value
def prepare(self, n):
# Phase 1b: promise not to accept anything below n
if self.promised_n is None or n > self.promised_n:
self.promised_n = n
return ('promise', self.accepted_n, self.accepted_v)
return ('reject', self.promised_n)
def accept(self, n, value):
# Phase 2b: accept only if we have not promised a higher n
if self.promised_n is None or n >= self.promised_n:
self.promised_n = n
self.accepted_n = n
self.accepted_v = value
return ('accepted', n, value)
return ('reject', self.promised_n)Split-Brain and How Leaders Prevent It
Split-brain is the nightmare scenario where a network partition leaves two nodes each believing they are the leader, both accepting writes, and the cluster's state forking into two irreconcilable histories. Quorums prevent this by construction: a leader can only commit with a majority, and there is room for at most one majority in the cluster. The side of a partition that lacks a majority cannot commit anything. In Raft, a stale leader on the minority side discovers a higher term the moment connectivity returns and steps down, discarding its uncommitted entries. The term number is the fencing token that makes this safe.
FLP Impossibility and Failure Assumptions
The FLP impossibility result (Fischer, Lynch, and Paterson, 1985) proves that in a fully asynchronous system โ no bound on message delay โ no deterministic protocol can guarantee consensus if even a single node may crash, because a crashed node is indistinguishable from a merely slow one. Raft and Paxos do not break this theorem; they sidestep it. They always preserve safety (no wrong decision) and guarantee liveness (progress) only under a partial synchrony assumption: that the network is eventually well-behaved enough for timeouts to be meaningful. This is why Raft uses randomized election timers โ to ensure that during chaotic periods the system stalls rather than corrupts, and resumes progress once the network calms down.
Consensus in Real Systems
You rarely implement consensus yourself โ you depend on a battle-tested implementation. etcd uses Raft and is the backing store for Kubernetes, holding all cluster state and coordinating leader election for controllers. ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast), a primary-backup protocol very similar in spirit to Raft, and underpins Kafka, HBase, and countless coordination tasks. HashiCorp Consul and Nomad use Raft for service discovery and scheduling. Google Spanner runs Paxos groups per shard. The pattern is universal: a small, odd-sized quorum cluster becomes the single source of truth that larger stateless services build on.
| Raft | Paxos (Multi-Paxos) | ZAB | |
|---|---|---|---|
| Leader | Single elected leader | Stable distinguished proposer | Single leader |
| Designed for | Understandability | Theoretical generality | Primary-backup replication |
| Log model | Strongly leader-driven, contiguous | Per-slot decisions | Ordered atomic broadcast |
| Used by | etcd, Consul, TiKV, CockroachDB | Google Chubby, Spanner | ZooKeeper (Kafka, HBase) |
A practical takeaway: consensus is expensive โ every committed write costs a round trip to a majority โ so you keep the consensus cluster small and put as little as possible through it. Store cluster metadata, locks, and leader leases in etcd or ZooKeeper; serve the bulk of your data from systems that lean on those primitives but do not run consensus on the hot path.
Frequently Asked Questions
What is the difference between Raft and Paxos?
They solve the same problem โ agreeing on a replicated log โ and offer the same safety guarantees. Paxos came first and is more general but notoriously hard to implement correctly. Raft was designed for understandability: it enforces a single strong leader, a contiguous append-only log, and an explicit election protocol, which maps more directly to working code. Multi-Paxos with a stable leader ends up looking a lot like Raft.
How many nodes should a consensus cluster have?
Use an odd number โ almost always 3 or 5. Three nodes tolerate one failure and are fine for most deployments; five tolerate two failures and are common for critical infrastructure like etcd in large Kubernetes clusters. Going beyond seven rarely helps: more nodes mean every write must reach a larger majority, increasing latency, while the extra fault tolerance is seldom needed.
Does consensus violate the CAP theorem?
No โ it is a textbook CAP choice. Consensus systems are CP: they choose consistency over availability. During a network partition, the minority side cannot reach a quorum, so it refuses writes rather than risk diverging. Clients on that side see unavailability until the partition heals. This is the correct trade-off for systems of record like configuration stores and lock services, where a wrong answer is far worse than a delayed one.
Consensus is how unreliable machines tell one reliable story. A majority that always overlaps is the whole trick โ everything else is making that trick fast and understandable.
โ alokknight Engineering
