LSM Trees Explained: Memtables, SSTables, Bloom Filters & Compaction (Visualized)

A Log-Structured Merge tree (LSM tree) is a write-optimized storage engine that buffers incoming writes in an in-memory structure and periodically flushes them, in sorted order, to a series of immutable files on disk. Instead of updating data in place like a B-tree, an LSM tree only ever appends: every write becomes a fast sequential operation, and the messy work of reconciling and reclaiming space is deferred to a background process called compaction.

This design powers many of the databases that handle the highest write volumes in production: RocksDB, LevelDB, Apache Cassandra, ScyllaDB, and HBase all use LSM-based storage engines. If you have ever wondered how these systems ingest millions of writes per second on commodity disks, the answer is almost always an LSM tree.

The Write Path: WAL + Memtable

Every write into an LSM tree does two things. First, it is appended to a write-ahead log (WAL) on disk — a simple sequential append that survives crashes. Second, it is inserted into the memtable, an in-memory sorted structure (typically a skip list or balanced tree) that keeps keys ordered. Because the memtable lives in RAM and the WAL is a pure append, a write is acknowledged almost instantly, with no random disk seeks.

The memtable cannot grow forever. Once it crosses a size threshold, it is marked immutable and a fresh memtable takes over new writes. The full memtable is then written out to disk as a Sorted String Table (SSTable): a file whose key-value entries are stored in sorted order. Because the memtable was already sorted, this flush is one big sequential write. Once the SSTable is safely on disk, the corresponding WAL segment can be discarded.

Writes land in the memtable, then flush to an SSTable

Writes append to the WAL and insert into the sorted in-memory memtable. When it fills, it flushes to a new immutable SSTable on disk and a fresh memtable takes over.

Notice what this buys us: updates and deletes never overwrite anything. An update is just a new entry for the same key in a newer SSTable; a delete writes a special marker called a tombstone. The newest value wins. This append-only discipline is the source of an LSM tree's write speed — and of the read complexity we tackle next.

The Read Path: Newest First, with Bloom Filters

Because a key's history is scattered across the memtable and many SSTables, a read must look in the right order: newest source first. The engine checks the active memtable, then the immutable SSTables from most recent to oldest, and returns the first match it finds (a tombstone counts as a match and means "deleted"). Stopping at the first hit is what makes the newest write authoritative.

The danger is obvious: a key that does not exist forces a scan through every SSTable. The fix is a Bloom filter — a small probabilistic bit-array stored per SSTable. Before touching the file, the engine asks the Bloom filter "could this key be in here?" A "no" is always correct, so the entire SSTable is skipped without a disk read. A "yes" might be a false positive, so the file is checked. Bloom filters turn most negative lookups into a handful of in-memory bit checks.

A read scans newest-first; Bloom filters skip SSTables

The lookup checks the memtable, then each SSTable newest-first. A Bloom filter saying 'no' skips a file with zero disk reads; a 'maybe' triggers a read until the key is found.

Compaction: Merging SSTables

Left alone, SSTables pile up, slowing reads and wasting space on superseded values and tombstones. Compaction is the background process that fixes this: it reads several SSTables, merges them (a merge sort over already-sorted files), keeps only the newest value for each key, drops tombstoned keys, and writes a single new SSTable. The inputs are then deleted, reclaiming space.

Compaction merges SSTables and drops dead entries

Several SSTables are merge-sorted into one. For duplicate keys only the newest value survives; tombstoned (deleted) keys are dropped entirely, reclaiming space.

Size-Tiered vs Leveled Compaction

There are two dominant compaction strategies. Size-tiered compaction (the default in Cassandra) waits until several SSTables of similar size accumulate, then merges them into one larger table. It is cheap on writes but can leave many overlapping tables, hurting reads and temporarily doubling disk usage. Leveled compaction (used by LevelDB and RocksDB) organizes SSTables into levels where each level holds non-overlapping tables and is roughly ten times larger than the one above. Reads touch at most one table per level, so read and space amplification stay low — at the cost of more write work.

Write, Read, and Space Amplification

Compaction is not free — it rewrites the same data multiple times. Write amplification measures how many bytes are physically written per logical byte of user data; leveled compaction trades higher write amplification for lower read and space amplification. Read amplification is how many places a read may have to check (memtable plus several SSTables), which Bloom filters and compaction keep in check. Space amplification is the extra disk used by stale values and tombstones before compaction reclaims them. Tuning an LSM tree is fundamentally about balancing these three.

LSM Tree vs B-Tree

A B-tree (the engine behind PostgreSQL, MySQL/InnoDB, and most relational databases) updates data in place: it finds the right page and overwrites it, typically a random disk write plus a WAL entry. That is excellent for reads and range scans, but every write may seek to a random location. An LSM tree instead turns all writes into sequential appends and flushes, which is dramatically faster on both spinning disks and SSDs, especially under heavy write load. The trade-off is that reads may have to consult several files, and background compaction consumes CPU and I/O.

	LSM Tree	B-Tree
Write pattern	Sequential appends + flushes	In-place random page writes
Write throughput	Very high (write-optimized)	Moderate (seek per write)
Point read	Memtable + N SSTables (Bloom-filtered)	One tree traversal
Range scans	Merge across SSTables	Excellent (keys already in order)
Space overhead	Stale values until compaction	Fragmentation / partly-full pages
Background cost	Compaction (CPU + I/O)	Minimal
Used by	RocksDB, Cassandra, LevelDB, ScyllaDB, HBase	PostgreSQL, MySQL/InnoDB, SQLite

The rule of thumb: reach for an LSM tree when your workload is write-heavy or ingestion-bound (time-series, event logs, metrics, message queues, write-heavy key-value stores), and prefer a B-tree when you need low-latency point reads, strong range-scan performance, and transactional in-place updates. Many modern engines, like RocksDB, expose tuning knobs so you can move along this spectrum.

Frequently Asked Questions

Why are SSTables immutable?

Immutability is what makes the design simple and fast. Because an SSTable is never modified after it is written, writes are pure sequential I/O, reads need no locking, and the file can be cached, memory-mapped, and shared freely across threads. Changes are expressed as new entries in newer SSTables; the old data is reclaimed later by compaction rather than overwritten in place.

What happens to a delete in an LSM tree?

A delete writes a tombstone — a marker that says "this key is gone" — rather than removing anything immediately. Reads that hit the tombstone first treat the key as deleted, even if older SSTables still contain its value. The value and the tombstone are physically removed only during compaction, once it is certain no older SSTable still needs to be masked.

Do Bloom filters ever return wrong answers?

Only in one direction. A Bloom filter can produce a false positive ("maybe present" when the key is actually absent), which just causes one unnecessary SSTable read. It never produces a false negative: if it says "not present," the key is guaranteed absent and the file is safely skipped. The false-positive rate is tunable by sizing the bit-array and choosing the number of hash functions.

An LSM tree makes a simple bet: writing fast is worth reading a little harder. Buffer in memory, flush sequentially, and let compaction quietly clean up behind you.
— alokknight Engineering