LSM Trees Explained: Memtables, SSTables, Bloom Filters & Compaction (Visualized)
An LSM tree is a storage engine that buffers writes in memory and flushes them to immutable, sorted files on disk, turning random writes into fast sequential ones. This guide covers the write path, read path, Bloom filters, and compaction โ with live animations of each โ and why LSM trees beat B-trees for write-heavy workloads.
A Log-Structured Merge tree (LSM tree) is a write-optimized storage engine that buffers incoming writes in an in-memory structure and periodically flushes them, in sorted order, to a series of immutable files on disk. Instead of updating data in place like a B-tree, an LSM tree only ever appends: every write becomes a fast sequential operation, and the messy work of reconciling and reclaiming space is deferred to a background process called compaction.
This design powers many of the databases that handle the highest write volumes in production: RocksDB, LevelDB, Apache Cassandra, ScyllaDB, and HBase all use LSM-based storage engines. If you have ever wondered how these systems ingest millions of writes per second on commodity disks, the answer is almost always an LSM tree.
The Write Path: WAL + Memtable
Every write into an LSM tree does two things. First, it is appended to a write-ahead log (WAL) on disk โ a simple sequential append that survives crashes. Second, it is inserted into the memtable, an in-memory sorted structure (typically a skip list or balanced tree) that keeps keys ordered. Because the memtable lives in RAM and the WAL is a pure append, a write is acknowledged almost instantly, with no random disk seeks.
The memtable cannot grow forever. Once it crosses a size threshold, it is marked immutable and a fresh memtable takes over new writes. The full memtable is then written out to disk as a Sorted String Table (SSTable): a file whose key-value entries are stored in sorted order. Because the memtable was already sorted, this flush is one big sequential write. Once the SSTable is safely on disk, the corresponding WAL segment can be discarded.
Notice what this buys us: updates and deletes never overwrite anything. An update is just a new entry for the same key in a newer SSTable; a delete writes a special marker called a tombstone. The newest value wins. This append-only discipline is the source of an LSM tree's write speed โ and of the read complexity we tackle next.
The Read Path: Newest First, with Bloom Filters
Because a key's history is scattered across the memtable and many SSTables, a read must look in the right order: newest source first. The engine checks the active memtable, then the immutable SSTables from most recent to oldest, and returns the first match it finds (a tombstone counts as a match and means "deleted"). Stopping at the first hit is what makes the newest write authoritative.
The danger is obvious: a key that does not exist forces a scan through every SSTable. The fix is a Bloom filter โ a small probabilistic bit-array stored per SSTable. Before touching the file, the engine asks the Bloom filter "could this key be in here?" A "no" is always correct, so the entire SSTable is skipped without a disk read. A "yes" might be a false positive, so the file is checked. Bloom filters turn most negative lookups into a handful of in-memory bit checks.
Compaction: Merging SSTables
Left alone, SSTables pile up, slowing reads and wasting space on superseded values and tombstones. Compaction is the background process that fixes this: it reads several SSTables, merges them (a merge sort over already-sorted files), keeps only the newest value for each key, drops tombstoned keys, and writes a single new SSTable. The inputs are then deleted, reclaiming space.
Size-Tiered vs Leveled Compaction
There are two dominant compaction strategies. Size-tiered compaction (the default in Cassandra) waits until several SSTables of similar size accumulate, then merges them into one larger table. It is cheap on writes but can leave many overlapping tables, hurting reads and temporarily doubling disk usage. Leveled compaction (used by LevelDB and RocksDB) organizes SSTables into levels where each level holds non-overlapping tables and is roughly ten times larger than the one above. Reads touch at most one table per level, so read and space amplification stay low โ at the cost of more write work.
Write, Read, and Space Amplification
Compaction is not free โ it rewrites the same data multiple times. Write amplification measures how many bytes are physically written per logical byte of user data; leveled compaction trades higher write amplification for lower read and space amplification. Read amplification is how many places a read may have to check (memtable plus several SSTables), which Bloom filters and compaction keep in check. Space amplification is the extra disk used by stale values and tombstones before compaction reclaims them. Tuning an LSM tree is fundamentally about balancing these three.
LSM Tree vs B-Tree
A B-tree (the engine behind PostgreSQL, MySQL/InnoDB, and most relational databases) updates data in place: it finds the right page and overwrites it, typically a random disk write plus a WAL entry. That is excellent for reads and range scans, but every write may seek to a random location. An LSM tree instead turns all writes into sequential appends and flushes, which is dramatically faster on both spinning disks and SSDs, especially under heavy write load. The trade-off is that reads may have to consult several files, and background compaction consumes CPU and I/O.
| LSM Tree | B-Tree | |
|---|---|---|
| Write pattern | Sequential appends + flushes | In-place random page writes |
| Write throughput | Very high (write-optimized) | Moderate (seek per write) |
| Point read | Memtable + N SSTables (Bloom-filtered) | One tree traversal |
| Range scans | Merge across SSTables | Excellent (keys already in order) |
| Space overhead | Stale values until compaction | Fragmentation / partly-full pages |
| Background cost | Compaction (CPU + I/O) | Minimal |
| Used by | RocksDB, Cassandra, LevelDB, ScyllaDB, HBase | PostgreSQL, MySQL/InnoDB, SQLite |
The rule of thumb: reach for an LSM tree when your workload is write-heavy or ingestion-bound (time-series, event logs, metrics, message queues, write-heavy key-value stores), and prefer a B-tree when you need low-latency point reads, strong range-scan performance, and transactional in-place updates. Many modern engines, like RocksDB, expose tuning knobs so you can move along this spectrum.
Frequently Asked Questions
Why are SSTables immutable?
Immutability is what makes the design simple and fast. Because an SSTable is never modified after it is written, writes are pure sequential I/O, reads need no locking, and the file can be cached, memory-mapped, and shared freely across threads. Changes are expressed as new entries in newer SSTables; the old data is reclaimed later by compaction rather than overwritten in place.
What happens to a delete in an LSM tree?
A delete writes a tombstone โ a marker that says "this key is gone" โ rather than removing anything immediately. Reads that hit the tombstone first treat the key as deleted, even if older SSTables still contain its value. The value and the tombstone are physically removed only during compaction, once it is certain no older SSTable still needs to be masked.
Do Bloom filters ever return wrong answers?
Only in one direction. A Bloom filter can produce a false positive ("maybe present" when the key is actually absent), which just causes one unnecessary SSTable read. It never produces a false negative: if it says "not present," the key is guaranteed absent and the file is safely skipped. The false-positive rate is tunable by sizing the bit-array and choosing the number of hash functions.
An LSM tree makes a simple bet: writing fast is worth reading a little harder. Buffer in memory, flush sequentially, and let compaction quietly clean up behind you.
โ alokknight Engineering
