High Availability (HA) in System Design: Redundancy, Failover & Nines (Visualized)
High availability keeps a system running through failures by removing single points of failure and failing over automatically. This guide covers active-active vs active-passive redundancy, health checks, multi-AZ, quorum, data replication, and target nines — with live animations of each.
High availability (HA) is a design property in which a system is engineered to remain operational and reachable for a very high percentage of time by removing single points of failure and recovering from component failures automatically. Where plain availability just measures the fraction of time a system is up, HA is the deliberate architecture — redundancy, health checks, and automatic failover — that produces a high number. It differs from fault tolerance, which masks failures with zero interruption, and from disaster recovery, which restores service after a major outage.
The core idea is simple: assume every individual component will eventually fail, then make sure the failure of any one component does not take down the whole system. You achieve this with redundant copies of each part, a mechanism to detect failure quickly, and a way to shift work to a healthy copy — ideally fast enough that users barely notice.
Eliminating Single Points of Failure
A single point of failure (SPOF) is any component whose failure brings down the entire system — a lone database, one load balancer, a single availability zone, even a shared power supply. The first job of HA design is to enumerate every SPOF and add redundancy so no individual failure is fatal. If a part cannot be made redundant, it must at least be made to fail over quickly to a replacement.
Redundancy is layered: redundant application servers behind a load balancer, redundant load balancers sharing a floating virtual IP, replicated databases, multiple availability zones, and sometimes multiple regions. Each layer removes one class of SPOF, and the weakest layer caps the availability of the whole system.
Redundancy Topologies: Active-Passive vs Active-Active
There are two foundational redundancy patterns. In active-passive (failover) one node serves all traffic while a standby stays in sync and idle, ready to be promoted when the primary dies. In active-active all nodes serve traffic simultaneously, so a failure simply removes capacity rather than causing a switchover. Active-passive is simpler and avoids write conflicts; active-active uses hardware efficiently and tends to fail over faster, but demands conflict handling and careful state management.
The animation above shows the failover sequence: a floating IP routes every request to the primary, the standby continuously replicates state, and the moment the primary's health check fails the standby is promoted and the virtual IP swings to it. The short banner marks the failover window — the recovery-time objective you are trying to minimize.
| Aspect | Active-Passive | Active-Active |
|---|---|---|
| Traffic handling | Primary serves all; standby idle | All nodes serve simultaneously |
| Resource use | Standby capacity sits unused | Full fleet utilized |
| Failover | Promote standby (brief gap) | Drop node; survivors absorb load |
| Complexity | Lower; no write conflicts | Higher; needs conflict & state handling |
| Typical use | Relational primaries, stateful services | Stateless web tiers, caches, DNS |
Health Checks & Automatic Failover
Redundancy is useless without fast, accurate failure detection. A load balancer or orchestrator continuously probes each node — an HTTP GET /health, a TCP connect, or a heartbeat — and after a few consecutive failures marks the node unhealthy and removes it from rotation. The detection threshold is a trade-off: too sensitive and transient blips cause needless flapping; too lax and dead nodes keep receiving traffic. When the node recovers, it rejoins automatically, which is what makes the system self-healing.
Above, a global load balancer spreads traffic across two availability zones running active-active. When Zone B suffers an outage, its nodes fail their health checks and are pulled from the pool; Zone A instantly absorbs all the traffic. No request is dropped — the user sees nothing.
Load Balancing Across Replicas
Load balancing is the mechanism that makes redundancy useful day-to-day. By spreading requests across healthy replicas, the balancer both increases throughput and ensures that when one replica disappears the rest carry the load. For HA the balancer itself must not become a SPOF: run it in a redundant pair (active-passive with a floating IP, or active-active behind anycast/DNS) so the thing that protects you isn't the thing that takes you down.
Multi-AZ and Multi-Region
Replicas in the same rack share fate — one power or network failure can kill them all. Multi-AZ deployments place replicas in physically separate data centers within a region, each with independent power and networking, so a zone outage costs you capacity but not uptime. Multi-region goes further, surviving an entire region failure, at the cost of cross-region replication lag and far higher complexity. Most systems start multi-AZ and add multi-region only when an SLA or compliance requirement demands it.
Data Replication, Failover & Quorum
Stateful tiers are the hardest part of HA because data must be both redundant and consistent. Synchronous replication waits for a replica to acknowledge each write, giving zero data loss (RPO of zero) but adding latency; asynchronous replication is fast but can lose the most recent writes on failover. When the primary dies, a replica is promoted — automatically via a coordinator, or manually for safety.
To promote a new leader safely without two nodes both believing they are primary (split-brain), distributed systems use quorum: a majority of an odd number of nodes must agree before a write commits or a leader is elected. With five replicas, quorum is three; a network partition that isolates a minority of two simply blocks that minority from accepting writes, while the majority keeps serving. This is the backbone of consensus protocols like Raft and Paxos.
As the animation shows, when the partition isolates two replicas the remaining three retain a majority and continue committing writes. The isolated pair cannot reach quorum, so they refuse writes rather than diverge — trading a little availability for consistency, exactly the CAP-theorem choice a CP system makes.
The Trade-off: Cost & Complexity
Every nine of availability costs money and complexity. Active-passive doubles your hardware for a tier that mostly sits idle; active-active and multi-region add data-synchronization machinery, conflict resolution, and operational burden. More moving parts also mean more failure modes — an overly aggressive failover system can itself cause outages. The engineering goal is not maximum availability but the right availability for the business: match the target to what downtime actually costs.
Target Nines & SLAs
Availability is quoted in nines — the percentage of time the system is up. A service-level agreement (SLA) is the contractual promise; the service-level objective (SLO) is your internal target, usually stricter. Each additional nine cuts allowed downtime roughly tenfold and costs far more to reach, so teams pick the lowest tier that meets user and contract needs.
| Availability | Downtime per year | Downtime per month | Typical use |
|---|---|---|---|
| 99% (two nines) | ~3.65 days | ~7.3 hours | Internal tools |
| 99.9% (three nines) | ~8.77 hours | ~43.8 min | Standard web apps |
| 99.99% (four nines) | ~52.6 min | ~4.4 min | Business-critical SaaS |
| 99.999% (five nines) | ~5.26 min | ~26 sec | Telecom, payments |
A subtle point: availability multiplies across dependencies in series. A service that depends on three components each at 99.9% caps out near 99.7% — which is why HA designs favor redundancy (parallel paths) over long chains of single dependencies.
# HAProxy: active-active backends with health checks
# A dead node is detected and pulled from rotation automatically.
backend web_tier
balance roundrobin
option httpchk GET /health
# mark down after 3 failed checks (~6s), up after 2 good ones
default-server inter 2s fall 3 rise 2
server web1 10.0.1.11:8080 check
server web2 10.0.2.12:8080 check # different AZ
server web3 10.0.3.13:8080 check # different AZFrequently Asked Questions
What is the difference between high availability and fault tolerance?
High availability minimizes downtime by detecting failures and recovering quickly — there may be a brief interruption during failover. Fault tolerance aims for zero interruption by running fully redundant components in lockstep so a failure is masked entirely. Fault tolerance is stricter and far more expensive, so most systems target HA and reserve fault tolerance for the few components that truly cannot blink.
How many nines of availability do I actually need?
Match the target to the cost of downtime. Internal tools are fine at two or three nines; customer-facing SaaS typically targets 99.9%–99.99%; payment and telecom systems chase five nines. Each extra nine roughly multiplies cost, so the cheapest tier that meets your SLA and user expectations is the right one — over-engineering availability wastes budget and adds failure-prone complexity.
Does high availability protect against data loss?
Not by itself. HA keeps the service reachable, but whether you lose recent writes on failover depends on your replication strategy and recovery-point objective (RPO). Synchronous replication gives an RPO near zero; asynchronous replication can lose the last few writes. Protecting against data loss — corruption, accidental deletion, region loss — is the job of backups and disaster recovery, which complement HA rather than replace it.
High availability isn't a feature you bolt on — it's the assumption that everything fails, baked into every layer. Remove the single points of failure, detect the rest fast, and let the system heal itself.
— alokknight Engineering
