Bloom Filters Explained: Fast, Memory-Efficient Membership Tests (Visualized)
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can return 'definitely not in the set' or 'possibly in the set' โ it never has false negatives, but allows tunable false positives. Here's how it works, with live animations.
A Bloom filter is a compact, probabilistic data structure that answers one question: is this element possibly in the set, or definitely not? It uses a fixed array of bits and a few hash functions, trading a small, tunable rate of false positives for enormous memory savings over storing every element.
The Problem: 'Have I Seen This Before?'
Imagine checking whether a URL has already been crawled, or whether a key might exist before doing an expensive disk read. A hash set answers this exactly โ until you have billions of items and run out of memory. A Bloom filter answers the same question using a tiny, fixed amount of space, at the cost of occasionally saying "maybe" when the true answer is "no."
How It Works
You keep an array of m bits (all 0 initially) and k independent hash functions. To add an element, hash it k ways and set those k bits to 1. To query an element, hash it the same way and check those bits: if any is 0, the element was definitely never added; if all are 1, it is probably present.
No False Negatives, But False Positives
The defining guarantee: a Bloom filter never returns a false negative. If it says "not present," the element was truly never added. But because different elements can set overlapping bits, it can return a false positive โ "possibly present" for something that was never added. As the filter fills with items, more bits are set, and the false-positive probability rises:
Sizing a Bloom Filter
The false-positive rate depends on the number of bits m, the number of hashes k, and the number of inserted items n. For a target rate, the optimal settings are:
import math
# m = number of bits, n = expected items, p = target false-positive rate
def optimal_m(n, p):
return math.ceil(-(n * math.log(p)) / (math.log(2) ** 2))
def optimal_k(m, n):
return max(1, round((m / n) * math.log(2)))
# e.g. 1M items at 1% false positives:
m = optimal_m(1_000_000, 0.01) # ~9.6 million bits (~1.2 MB)
k = optimal_k(m, 1_000_000) # 7 hash functions| Lever | Effect |
|---|---|
| More bits (m) | Fewer false positives, more memory |
| More hashes (k) | A sweet spot exists; too many fills the array faster |
| More items (n) | Array saturates, false-positive rate climbs |
Variants
Counting Bloom filter: replaces each bit with a small counter so you can delete elements (a plain Bloom filter can't). Scalable Bloom filter: chains filters of growing size so it can hold an unknown number of items while keeping the false-positive rate bounded. Cuckoo filter: a modern alternative supporting deletion with comparable space.
Where They're Used
Databases like Cassandra and Google Bigtable use Bloom filters to skip disk reads for keys that can't exist in an SSTable. CDNs use them for "have we cached this?" checks. Chrome historically used one for its Safe Browsing URL blacklist, and Medium uses them to avoid recommending articles you've already read. They're everywhere a fast, cheap "definitely not" answer saves expensive work.
Frequently Asked Questions
Can a Bloom filter have false negatives?
No. A standard Bloom filter never returns a false negative โ if it says an element is not present, it truly was never added. It can only err in the other direction, with false positives.
Can you delete from a Bloom filter?
Not from a standard one โ clearing a bit could break other elements that share it. Use a counting Bloom filter or a cuckoo filter if you need deletion.
How big should a Bloom filter be?
Size it from the expected item count n and your target false-positive rate p using the formulas above. As a rule of thumb, roughly 9.6 bits per element gives about a 1% false-positive rate.
A Bloom filter trades a little accuracy for a lot of memory. 'Definitely no' is often all you need to skip expensive work.
โ alokknight Engineering
