HyperLogLog Explained: Count Billions of Distinct Items with 12 KB
HyperLogLog is a probabilistic algorithm that estimates the number of distinct elements in a data stream using a tiny, fixed amount of memory โ a few kilobytes to count billions of unique values within roughly 2% error. It powers Redis PFCOUNT, Google BigQuery, and every analytics system that needs cardinality at scale.
HyperLogLog is a probabilistic algorithm that estimates the cardinality (number of distinct elements) of a multiset using a fixed, extremely small amount of memory โ typically 12 KB โ regardless of whether the dataset contains one thousand or one trillion unique values. It achieves this by trading exact precision for a tunable, predictably small error rate of around 0.81% to 2%, making it the go-to data structure whenever you need approximate distinct counts at massive scale.
The cardinality problem appears constantly in production systems: how many unique visitors hit your site today? How many distinct IP addresses tried to log in? How many different product SKUs have been viewed this hour? A naive solution โ storing every element in a hash set โ costs memory proportional to the number of unique items. For billions of users that is gigabytes of RAM per query, per time window. HyperLogLog solves this in constant space.
The Core Intuition: Leading Zeros as a Cardinality Signal
HyperLogLog's brilliance comes from a probabilistic observation about hash functions. When you hash a value uniformly at random into a binary string, the probability of seeing at least k leading zeros is 1 / 2^k. So if the maximum number of leading zeros you have ever seen in your stream is 5, then you probably hashed around 2^5 = 32 distinct elements. If it is 20 leading zeros, you have probably seen around a million distinct values.
This single register gives a very rough estimate with very high variance. The key improvement in HyperLogLog is using many independent buckets (called registers) and averaging their estimates. The first few bits of the hash value select which register to update; the remaining bits are used to count leading zeros. With 214 = 16,384 registers, each storing a 6-bit integer, the total memory is just 12 KB and the error drops to 0.81%.
Many Registers, One Accurate Estimate
A single register estimate has catastrophic variance โ one unlucky hash with 20 leading zeros would suggest you have a million distinct values even if you only have three. HyperLogLog fixes this with stochastic averaging: the hash output is split so the first b bits select one of 2^b registers, and the remaining bits are used to measure leading zeros. Each register independently tracks its own maximum, then all registers are combined using the harmonic mean, which is resistant to the high outliers that would distort a simple arithmetic average.
The harmonic mean of m registers gives an estimate proportional to m / sum(2^(-M[j])) where M[j] is the maximum leading-zeros count in register j. This averaging is why HyperLogLog achieves a standard error of 1.04 / sqrt(m) โ with 16,384 registers the error is 0.81%. More registers, more memory, less error.
Memory: Exact Set vs HyperLogLog
The most striking property of HyperLogLog is that its memory footprint is completely independent of the number of distinct elements. An exact hash set needs roughly 50โ100 bytes per unique item in a production implementation (Java HashSet uses around 80 bytes per element due to object overhead and load factor). A Redis HyperLogLog is capped at 12 KB no matter if it has counted 10 or 10 billion distinct values.
To count 1 billion distinct user IDs exactly you would need roughly 80 GB of RAM. HyperLogLog does it in 12 KB โ a compression ratio of over 6,000,000:1 โ with only 0.81% error. For analytics workloads where you need unique visitor counts, this trade-off is almost always the right one.
HyperLogLog in Redis: PFADD and PFCOUNT
Redis ships a production-ready HyperLogLog implementation under the PF prefix (after Philippe Flajolet, who co-invented the algorithm in 2007). It is the simplest way to add approximate cardinality counting to any system. PFADD adds one or more elements; PFCOUNT returns the estimated cardinality. The commands are O(1) for any size dataset.
# Track unique visitors per day
PFADD visitors:2024-04-29 user:1001 user:2034 user:1001 user:5521
# Returns 1 (HLL was modified)
PFCOUNT visitors:2024-04-29
# Returns 3 (user:1001 is deduplicated)
# Merge multiple HLLs (e.g., combine daily visitors into weekly)
PFMERGE visitors:week:17 visitors:2024-04-29 visitors:2024-04-28 visitors:2024-04-27
PFCOUNT visitors:week:17
# Approximate unique visitors across all three daysRedis uses a sparse representation for small cardinalities (under ~3,000 elements) that is much smaller than 12 KB, then automatically promotes to a dense 12 KB representation as the cardinality grows. The PFMERGE command merges multiple HLL structures into one in O(n) time โ critical for use cases like combining per-server counts into a global count. Importantly, HLL merge is perfectly accurate: the result is statistically equivalent to having run a single HLL over all the inputs combined.
When to Use HyperLogLog vs Exact Counting
HyperLogLog is not a universal replacement for sets. It cannot enumerate the distinct elements โ only count them โ and the count has an error margin. The decision comes down to whether you need the exact number or just an accurate estimate, and how much memory you can afford. The table below summarizes the key trade-offs.
| Property | Exact Set (e.g., Redis SET) | HyperLogLog (Redis PF*) |
|---|---|---|
| Memory at 1M items | ~80 MB (80 bytes/item) | 12 KB (fixed) |
| Memory at 1B items | ~80 GB | 12 KB (fixed) |
| Distinct count accuracy | 100% exact | ~99.2% (0.81% std error) |
| Can list all elements? | Yes (SMEMBERS) | No |
| Can check membership? | Yes (SISMEMBER) | No |
| Add element speed | O(1) | O(1) |
| Merge two structures | O(n) union | O(m) register-wise max |
| Best use case | Small sets, exact dedup needed | Massive cardinality, analytics |
Real-World Use Cases
Unique visitor counting: track distinct user IDs per page per day with a single Redis key per page-date pair. A website with 10,000 pages and 365 days would need 3.65 million keys at 12 KB each โ about 44 GB โ versus hundreds of terabytes for exact sets at scale. Distinct query counting: search engines and databases use HLL to estimate how many distinct queries hit a given shard, informing partitioning decisions. Network monitoring: count distinct source IPs in a traffic window to detect DDoS without storing individual IPs. A/B testing: estimate the number of unique users exposed to each variant across distributed frontends, then merge the per-server HLLs.
Google BigQuery, Apache Spark, and Apache Flink all expose HyperLogLog-based approximate distinct count functions. In BigQuery the function is APPROX_COUNT_DISTINCT(column). These systems use HLL sketches internally so that distributed cardinality estimation can happen by exchanging small sketch buffers instead of full datasets across network partitions โ the same merge property that makes PFMERGE powerful.
Under the Hood: The HyperLogLog++ Refinements
The original 2007 Flajolet-Martin algorithm was improved in Google's 2013 HyperLogLog++ paper with three important refinements. First, it switches to a 64-bit hash to avoid collisions at very high cardinalities (the original 32-bit version breaks above ~10^9 items). Second, it uses a more accurate bias correction for small and large cardinality ranges where the harmonic mean formula overestimates. Third, it uses a sparse representation for low cardinalities that is far more memory efficient than the fixed 12 KB dense representation. Redis's implementation incorporates these improvements, which is why Redis HyperLogLog works accurately from 1 to 1018 distinct elements.
Frequently Asked Questions
Can HyperLogLog give a wrong answer that is badly off?
The error is probabilistic but tightly bounded. With Redis's 16,384-register implementation (standard error 0.81%), roughly 65% of queries will be within 0.81% of the true value, and over 99.7% will be within 2.43% (three standard deviations). In practice, for analytics workloads like daily unique visitors, this level of precision is indistinguishable from exact counting for any business purpose. For cases where you genuinely need exact counts on small sets โ billing, deduplicating financial transactions โ use an exact set instead.
Does PFMERGE lose accuracy?
No. Merging two HyperLogLog structures is lossless from a statistical standpoint. Each register in the merged result takes the element-wise maximum of the corresponding registers from each source. Because the registers are independent, this is mathematically equivalent to having added all elements into a single HLL from the start. The resulting error on the merged estimate is the same 0.81% as for a single HLL โ you do not accumulate additional error by merging. This makes HLL ideal for distributed systems where each node maintains its own HLL and periodically ships it to a coordinator.
Why is the harmonic mean used instead of the arithmetic mean?
The arithmetic mean is dominated by outliers. If one register sees 20 leading zeros (suggesting 220 โ 1 million elements) while the other 16,383 registers see 2โ4 leading zeros, the arithmetic average would be hugely biased upward. The harmonic mean โ n / sum(1/x) โ weights small values more heavily and large outliers less heavily, so a single register with an accidentally large value cannot distort the estimate. This property is essential to HyperLogLog's accuracy. The correction constant 0.7213 in the formula accounts for a systematic bias that remains even after the harmonic mean, making the final estimate nearly unbiased.
HyperLogLog is the quintessential system design trade-off: a provably tiny, bounded error buys you a six-million-to-one compression ratio. Once you stop insisting on exactness you don't need, the algorithm pays for itself on the first billion users.
โ alokknight Engineering
