Data Warehouse in System Design: Columnar Storage, Star Schema, ETL & MPP (Visualized)
A data warehouse is a central analytical store optimized for complex queries over massive historical datasets. This guide covers columnar storage, MPP engines, star and snowflake schemas, ETL vs ELT, OLTP vs OLAP, and how modern platforms like Snowflake, BigQuery, and Redshift are built β with live animations.
A data warehouse is a centralized repository that stores large volumes of structured, historical data and is optimized for read-heavy analytical queries rather than for the high-frequency transactional writes of an operational system. Where a production database answers questions like "what did this user order five minutes ago?", a warehouse answers questions like "what were our top-selling product categories by region over the last three years?" β scanning billions of rows to return a single aggregated result.
Warehouses sit downstream of production systems. Raw transactional data is extracted from databases, logs, and third-party APIs; transformed into a clean, consistent shape; and loaded into the warehouse. Business analysts and data scientists then run SQL queries, build dashboards, or train machine learning models on top of it. Because analytical queries touch wide swaths of data rather than individual rows, warehouses are architected very differently from ordinary relational databases.
OLTP vs OLAP: Two Fundamentally Different Workloads
The distinction between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) is the most important concept in data warehouse design. OLTP systems β PostgreSQL, MySQL, MongoDB β are optimized for thousands of concurrent small reads and writes that touch a few rows at a time. OLAP systems are designed for rare but enormous queries that scan millions or billions of rows and return aggregated statistics.
| OLTP | OLAP (Data Warehouse) | |
|---|---|---|
| Primary workload | INSERT / UPDATE / DELETE + point reads | Aggregate SELECT over huge ranges |
| Query pattern | Many small, fast transactions | Few large, long-running scans |
| Data scope per query | 1β100 rows | Millionsβbillions of rows |
| Schema style | 3NF normalized, minimal redundancy | Denormalized star / snowflake schemas |
| Storage layout | Row-oriented (easy full-row access) | Column-oriented (efficient column scans) |
| Typical latency | < 10 ms | Seconds to minutes |
| Examples | PostgreSQL, MySQL, Oracle OLTP | Snowflake, BigQuery, Redshift, ClickHouse |
Columnar Storage: Why Warehouses Store Data Differently
Row-oriented databases (PostgreSQL, MySQL) store each row contiguously on disk: all columns for row 1, then all columns for row 2, and so on. This is great for INSERT and full-row UPDATE but terrible for analytical queries that only need two or three columns out of fifty. An analytical query like SELECT SUM(revenue) FROM orders WHERE year = 2023 must still read every column of every row into memory β most of it wasted.
Columnar storage flips this: all values for column 1 are stored together, then all values for column 2, and so on. The query above now only reads the revenue and year columns off disk, skipping the other 48. That alone can reduce I/O by 90%+. Columnar layout also enables excellent compression β because a column holds one data type and values are often similar (e.g., a country column with 100 distinct values across 10 billion rows), run-length encoding or dictionary compression achieves ratios of 10:1 or better.
Massively Parallel Processing (MPP)
A single machine cannot scan petabytes of data in seconds. Modern warehouses use Massively Parallel Processing (MPP): the query is split into fragments, each fragment runs on a separate node working on its slice of the data, and results are merged. Snowflake calls each compute cluster a virtual warehouse; BigQuery uses a serverless MPP engine called Dremel; Redshift distributes data across slices within nodes. The key insight is that storage and compute are often separated β you can spin up a large compute cluster for a heavy workload and scale it back down, paying only for what you use.
ETL and ELT: Getting Data Into the Warehouse
ETL (Extract, Transform, Load) is the traditional pipeline: data is pulled from source systems, cleaned and reshaped in a staging environment, and then loaded into the warehouse in its final form. This keeps raw messy data out of the warehouse but requires a powerful external processing layer.
ELT (Extract, Load, Transform) has become dominant with cloud warehouses: raw data is loaded first, then transformed inside the warehouse using SQL and tools like dbt. Because modern warehouses have near-unlimited compute, running heavy transformations inside them is often cheaper and faster than maintaining a separate processing cluster. The raw source data is preserved for reprocessing.
Star Schema and Snowflake Schema
Warehouses use denormalized schemas optimized for fast reads rather than the normalized schemas used in OLTP systems. The two dominant patterns are the star schema and the snowflake schema.
In a star schema, a central fact table holds measurable events or metrics (e.g., each row is one order with columns for order_id, amount, quantity, and foreign keys). Surrounding dimension tables hold descriptive context: dim_customer, dim_product, dim_date, dim_store. The fact table has one foreign key per dimension, and each dimension is fully denormalized into a single flat table. The result looks like a star β one center, many arms β and joins are simple: fact joins directly to each dimension.
A snowflake schema normalizes dimension tables further. Instead of one flat dim_product, you might have dim_product referencing dim_category and dim_supplier. This reduces storage redundancy but requires more joins, which can slow analytical queries. Most teams prefer star schemas for their simplicity and query speed.
Data Warehouse vs Data Lake vs Lakehouse
As the volume and variety of data grew, two alternative paradigms emerged alongside the classical warehouse. A data lake (e.g., AWS S3 + Athena, HDFS + Hive) stores raw data in any format β CSV, JSON, Parquet, images, video β at very low cost, with schema applied only at query time (schema-on-read). This is flexible but can become a disorganized "data swamp" without governance. A lakehouse (Databricks Delta Lake, Apache Iceberg, Apache Hudi) blends both: data sits on cheap object storage in open table formats that support ACID transactions, schema enforcement, and fast SQL β giving warehouse-quality queries at data-lake costs.
| Data Warehouse | Data Lake | Lakehouse | |
|---|---|---|---|
| Storage format | Proprietary columnar | Any (raw files) | Open columnar (Parquet + Delta/Iceberg) |
| Schema enforcement | Schema-on-write (strict) | Schema-on-read (flexible) | Both β enforced optionally |
| Data types | Structured only | Structured, semi-, unstructured | Structured + semi-structured |
| ACID transactions | Yes | No (usually) | Yes (Delta Lake, Iceberg) |
| Cost at scale | High (proprietary storage) | Low (object storage) | Low (object storage) |
| Query performance | Very fast | Slower (no indexes) | Fast (statistics + Z-ordering) |
| Best for | BI dashboards, SQL analytics | ML training data, raw archival | Unified analytics + ML platform |
Modern Warehouse Platforms: Snowflake, BigQuery, and Redshift
Snowflake pioneered the separation of storage and compute in cloud warehouses. Data lives in S3 in Snowflake's micro-partition format; compute runs in independently-scalable virtual warehouses (clusters of VMs). You can run multiple workloads β BI dashboards and ETL pipelines β simultaneously on the same data without resource contention, and scale each cluster up or down in seconds.
Google BigQuery uses a serverless model: there are no clusters to manage. Queries are split across thousands of workers by the Dremel engine, billed by bytes scanned. BigQuery's slot model and automatic scaling make it attractive for intermittent workloads. Storage uses Capacitor, Google's proprietary columnar format, with automatic partitioning and clustering to minimize scans.
Amazon Redshift distributes data across compute nodes using a distribution key you choose. Choosing the right DISTKEY and SORTKEY is crucial for performance: if the distribution key matches your most common join or filter column, data moves minimally between nodes. Redshift Serverless removes manual cluster sizing. All three platforms support external tables, semi-structured JSON via SUPER/VARIANT types, and native integration with their cloud ecosystems.
SQL on a Star Schema: A Concrete Example
The query below joins a fact table to three dimension tables and aggregates revenue by quarter and product category β a typical warehouse query pattern. Note how all filtering happens on dimension columns, and the engine uses columnar storage to scan only the required columns in each table.
-- Star schema analytical query: revenue by quarter and category
SELECT
d.year,
d.quarter,
p.category,
SUM(f.revenue) AS total_revenue,
COUNT(DISTINCT f.order_id) AS order_count
FROM
fact_orders f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_customer c ON f.customer_id = c.customer_id
WHERE
d.year = 2023
AND c.segment = 'Enterprise'
GROUP BY
d.year, d.quarter, p.category
ORDER BY
d.quarter, total_revenue DESC;Frequently Asked Questions
What is the difference between a data warehouse and a regular database?
A regular OLTP database (PostgreSQL, MySQL) is optimized for high-frequency, low-latency transactional operations β inserts, updates, and point reads. It uses row-oriented storage, normalized schemas, and row-level locking. A data warehouse is optimized for analytical queries over historical data: it uses columnar storage, denormalized star schemas, and massively parallel query execution. They answer different questions: the OLTP database tells you the current state of one record; the warehouse tells you aggregated trends across billions of records over years.
When should I use ETL vs ELT?
Use ETL when data must be cleaned or masked before it enters your environment β for regulatory reasons, or when raw data is too large/messy to load directly. Use ELT when you have a powerful cloud warehouse (Snowflake, BigQuery, Redshift) and want to preserve raw data for iterative transformation using tools like dbt. ELT has become the modern default because cloud warehouses have the compute to transform data internally at scale, and keeping raw data lets you re-derive derived tables as business logic changes.
Why does columnar storage help analytical queries so much?
An analytical query typically selects 2β5 columns out of 50+ and aggregates them over millions of rows. With row-oriented storage, reading those 2 columns still requires loading all 50 columns of every row into memory β 96% of the I/O is wasted. Columnar storage stores each column separately, so only the needed columns are read from disk. Combined with compression (dictionary encoding, run-length encoding), columnar layouts reduce I/O by 80β95% for typical analytical queries, which translates directly into faster query times and lower compute costs.
A data warehouse is where operational noise becomes analytical signal β columnar storage, MPP, and smart schemas turn raw transaction history into the insights that drive product and business decisions.
β alokknight Engineering
