Time Series Databases in System Design: Storage, Compression, Rollups & Retention (Visualized)
A time-series database is purpose-built to store and query data points indexed by time โ from server metrics to IoT sensors to financial ticks. This guide covers append-heavy writes, columnar compression, downsampling, retention policies, and the leading systems (InfluxDB, Prometheus, TimescaleDB) โ with live animations.
A time-series database (TSDB) is a storage engine optimized for sequences of values each stamped with a precise timestamp โ server CPU readings every 10 seconds, temperature sensors every millisecond, stock trades to the microsecond โ where the timestamp is the primary dimension for indexing, querying, and aggregation.
General-purpose relational databases can store time-stamped rows, but they were designed for random reads and writes across arbitrary columns. Time-series workloads are radically different: writes are almost always appends to the present moment, reads typically scan a contiguous time window, and data grows monotonically until a retention policy expires it. Those differences justify a purpose-built engine.
What Makes Time-Series Data Different
Three properties distinguish time-series data from ordinary rows in a relational table:
1. Append-heavy writes. Sensors and monitoring agents produce a constant stream of new readings. Updates or deletes of historical records are rare โ you almost never go back and change last week's CPU reading. This means the write path can be heavily optimized for sequential append rather than random-access update.
2. Time-ordered reads. Queries almost always ask about a range of time: "give me the p99 latency for the last 24 hours, bucketed by minute." Storing data in timestamp order on disk means a range scan is a single sequential I/O pass, not a scattered random-access search.
3. High cardinality tags. Each measurement comes with metadata tags โ host=web-01, region=us-east, service=api โ that identify where it came from. The cross product of tag values creates high cardinality, which blows up naive index structures. TSDBs use inverted tag indexes designed for this pattern.
Streaming Data Points: The Write Path
Incoming data points โ each a tuple of (timestamp, metric_name, tags, value) โ are buffered in an in-memory structure sometimes called a WAL shard or head chunk. Once the chunk fills (typically a few thousand points or a time boundary), it is compressed and flushed to an immutable on-disk block. This mirrors an LSM-tree design: fast sequential writes to memory, then a background compaction that merges and re-compresses blocks into larger, more efficient files.
Columnar Compression of Timestamps and Values
The most impactful design choice in a TSDB is columnar, chunk-based storage. Rather than storing each data point as a complete row, a TSDB stores timestamps in one array and values in a parallel array, then compresses each independently.
For timestamps collected at a regular interval (say every 10 s), the deltas between consecutive timestamps are all the same number. Delta-of-delta encoding (invented at Facebook for Gorilla, adopted by Prometheus and others) represents this constant delta as a single bit. For values, XOR encoding captures the bits that changed between consecutive float64 samples โ if CPU usage barely moved, only a few bits differ, and the result compresses to 1โ2 bytes per point. Real-world systems achieve 1.37 bytes per sample on average (Gorilla paper, 2015), versus 16+ bytes for a raw timestamp+float64 row.
Downsampling and Rollups
Raw 10-second data is invaluable for debugging a production incident, but it is overkill for a dashboard showing the last 90 days. Storing three months of per-second samples for a thousand metrics would require tens of terabytes. Downsampling (also called rollups or continuous aggregates) solves this by periodically computing aggregates over a coarser time bucket and storing only the summary.
A typical tier structure: raw 10 s samples are kept for 7 days; 5-minute averages/min/max are kept for 30 days; 1-hour rollups are kept for 1 year. Queries against a long time range automatically resolve to the appropriate tier, returning answers in milliseconds instead of scanning billions of raw points. InfluxDB calls these continuous queries; TimescaleDB calls them continuous aggregates; Prometheus uses recording rules.
Retention Policies: Automatically Expiring Old Data
Time-series data grows forever unless you set a retention policy โ a rule that automatically deletes data older than a configured duration. In InfluxDB you assign a retention policy (RP) to a database (e.g., DURATION 7d); in Prometheus you pass --storage.tsdb.retention.time=15d; in TimescaleDB you call add_retention_policy() on a hypertable.
Because TSDBs partition data into time-bounded blocks (Prometheus uses 2-hour blocks), expiry is an O(1) directory deletion rather than a row-by-row DELETE scan. The system simply removes the oldest block directory once its end-time falls outside the retention window โ no compaction required. This is why a TSDB can efficiently retain only what you need and stay fast even after years of operation.
Querying by Time Windows and Aggregations
Every TSDB ships with a query language tailored to time-range scans and moving-window aggregations. The three most common examples:
InfluxDB Flux pipes data through a functional chain โ filter by measurement and tags, window into 5-minute buckets, compute mean per bucket.
PromQL (Prometheus) uses an expression language where rate(http_requests_total[5m]) computes the per-second rate over a 5-minute sliding window, and avg by(job)(cpu_usage) aggregates across labels.
TimescaleDB extends PostgreSQL with time-bucket functions, so you can write standard SQL enriched with time_bucket('5 minutes', ts) and benefit from automatic query planner optimizations for hypertable partitions.
-- TimescaleDB: 5-minute average CPU over the past hour
SELECT
time_bucket('5 minutes', time) AS bucket,
host,
AVG(value) AS avg_cpu,
MAX(value) AS peak_cpu,
MIN(value) AS min_cpu
FROM cpu_metrics
WHERE
time > NOW() - INTERVAL '1 hour'
AND metric = 'cpu_usage'
GROUP BY bucket, host
ORDER BY bucket;
-- InfluxDB Flux: same query in functional style
-- from(bucket: "metrics")
-- |> range(start: -1h)
-- |> filter(fn: (r) => r._measurement == "cpu_usage")
-- |> aggregateWindow(every: 5m, fn: mean, createEmpty: false)
-- PromQL: rate over 5-minute window
-- rate(http_requests_total{job="api"}[5m])Leading Time-Series Databases Compared
Three systems dominate the open-source TSDB landscape, each making different trade-offs between familiarity, scalability, and operational simplicity.
Prometheus is the de-facto standard for infrastructure monitoring in Kubernetes environments. It uses a pull model (it scrapes metrics from targets), stores data in its own TSDB format with 2-hour blocks, and is designed for single-node reliability without external storage dependencies. Long-term retention requires a remote-write adapter (Thanos, Cortex, Mimir).
InfluxDB targets operational analytics and IoT workloads. Versions 1.x used InfluxQL (SQL-like); version 3.x adopted Apache Arrow + Apache Parquet and a SQL dialect, making it highly compatible with the data engineering ecosystem. It supports both push (line protocol) and pull ingestion and ships with built-in downsampling tasks.
TimescaleDB is a PostgreSQL extension, giving you full SQL compatibility, JOINs with relational tables, foreign keys, and mature tooling โ while adding hypertables (automatic time-based partitioning), continuous aggregates, and compression policies. Teams that already operate PostgreSQL get TSDB capabilities with zero new infrastructure.
| Prometheus | InfluxDB | TimescaleDB | |
|---|---|---|---|
| Query language | PromQL | Flux / SQL (v3) | SQL (PostgreSQL) |
| Ingestion model | Pull (scrape) | Push (line protocol) | Push (SQL INSERT / COPY) |
| Storage engine | Custom TSDB (2-hr blocks) | TSM / Parquet (v3) | PostgreSQL hypertables |
| Compression | Delta-of-delta + XOR | Gorilla-style + columnar | Columnar + dictionary |
| Built-in rollups | Recording rules | Continuous queries / Tasks | Continuous aggregates |
| Retention policy | --retention.time flag | DURATION clause per RP | add_retention_policy() |
| Best for | K8s / infra monitoring | IoT, high-write analytics | Relational + time-series mix |
| Clustering | Via Thanos / Cortex / Mimir | InfluxDB Clustered (v3) | Citus + Timescale Cloud |
High Cardinality: The Silent Killer
High cardinality is the most common production pitfall in TSDB deployments. If you add a label whose value is unique per request โ such as user_id, trace_id, or session_id โ the number of unique time series explodes. Prometheus keeps an in-memory inverted index for all active series; a cardinality explosion fills RAM and causes out-of-memory crashes. The rule of thumb: only label with dimensions you will actually GROUP BY in queries. For high-cardinality tracing and event data, use a tracing backend (Jaeger, Tempo) or a log aggregator (Loki, OpenSearch), not a metrics TSDB.
Frequently Asked Questions
Why not just use PostgreSQL or MySQL for time-series data?
General-purpose relational databases store rows on heap pages designed for random-access reads and in-place updates. A time-series workload is almost entirely sequential appends followed by range scans โ a pattern that causes massive B-tree bloat, table bloat, and autovacuum pressure in Postgres. More critically, timestamp-range queries on a plain table require a full index scan that does not compress well. TSDBs achieve 10โ50x better write throughput and 10โ100x better compression on the same hardware because every design decision โ from the on-disk format to the compaction strategy โ targets this specific access pattern. TimescaleDB is the best of both worlds if you need SQL joins with relational data alongside time-series queries.
How does Prometheus store data differently from InfluxDB?
Prometheus writes incoming samples into a 2-hour in-memory head block backed by a write-ahead log (WAL). When the block fills, it is compacted to disk as an immutable chunk using delta-of-delta timestamp encoding and XOR float encoding. Larger compaction levels merge multiple 2-hour blocks into 6-hour, then 24-hour blocks. InfluxDB originally used its own Time-Structured Merge tree (TSM) format โ a variant of LSM tuned for time data โ but InfluxDB v3 switched to Apache Parquet columns stored in object storage, making it fundamentally a cloud-native columnar store. The practical difference: Prometheus is optimized for fast range scans on a local SSD; InfluxDB v3 is optimized for petabyte-scale storage on cheap object storage (S3) with separate compute.
When should I downsample and when should I keep raw data?
Keep raw data for as long as you need to perform per-sample forensic analysis โ typically 7โ30 days for infrastructure metrics. Beyond that, aggregated rollups (5-minute or 1-hour averages, min, max, p95) preserve the shape of trends and anomalies for dashboards and capacity planning at a fraction of the storage cost. The practical threshold: if your dashboard time range is wider than 2x the raw scrape interval, you are wasting I/O reading raw points. Design your tier structure so that the finest resolution fits comfortably in a week's retention, the medium tier in 30โ90 days, and summary tier for a year or more. Alert rules should always evaluate against raw or near-raw data, never rolled-up data, since aggregation can smooth out the spikes that trigger an alert.
Time series data is the heartbeat of your system. Model it right โ tag what you query, retain what you need, roll up what you don't โ and it will tell you exactly what went wrong and when.
โ alokknight Engineering
