Data Warehouse in System Design: Columnar Storage, Star Schema, ETL & MPP (Visualized)

A data warehouse is a centralized repository that stores large volumes of structured, historical data and is optimized for read-heavy analytical queries rather than for the high-frequency transactional writes of an operational system. Where a production database answers questions like "what did this user order five minutes ago?", a warehouse answers questions like "what were our top-selling product categories by region over the last three years?" — scanning billions of rows to return a single aggregated result.

Warehouses sit downstream of production systems. Raw transactional data is extracted from databases, logs, and third-party APIs; transformed into a clean, consistent shape; and loaded into the warehouse. Business analysts and data scientists then run SQL queries, build dashboards, or train machine learning models on top of it. Because analytical queries touch wide swaths of data rather than individual rows, warehouses are architected very differently from ordinary relational databases.

OLTP vs OLAP: Two Fundamentally Different Workloads

The distinction between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) is the most important concept in data warehouse design. OLTP systems — PostgreSQL, MySQL, MongoDB — are optimized for thousands of concurrent small reads and writes that touch a few rows at a time. OLAP systems are designed for rare but enormous queries that scan millions or billions of rows and return aggregated statistics.

	OLTP	OLAP (Data Warehouse)
Primary workload	INSERT / UPDATE / DELETE + point reads	Aggregate SELECT over huge ranges
Query pattern	Many small, fast transactions	Few large, long-running scans
Data scope per query	1–100 rows	Millions–billions of rows
Schema style	3NF normalized, minimal redundancy	Denormalized star / snowflake schemas
Storage layout	Row-oriented (easy full-row access)	Column-oriented (efficient column scans)
Typical latency	< 10 ms	Seconds to minutes
Examples	PostgreSQL, MySQL, Oracle OLTP	Snowflake, BigQuery, Redshift, ClickHouse

OLTP vs OLAP: write patterns and query shapes

Watch how OLTP handles many tiny writes while OLAP fires one massive aggregating scan across all rows.

Columnar Storage: Why Warehouses Store Data Differently

Row-oriented databases (PostgreSQL, MySQL) store each row contiguously on disk: all columns for row 1, then all columns for row 2, and so on. This is great for INSERT and full-row UPDATE but terrible for analytical queries that only need two or three columns out of fifty. An analytical query like SELECT SUM(revenue) FROM orders WHERE year = 2023 must still read every column of every row into memory — most of it wasted.

Columnar storage flips this: all values for column 1 are stored together, then all values for column 2, and so on. The query above now only reads the revenue and year columns off disk, skipping the other 48. That alone can reduce I/O by 90%+. Columnar layout also enables excellent compression — because a column holds one data type and values are often similar (e.g., a country column with 100 distinct values across 10 billion rows), run-length encoding or dictionary compression achieves ratios of 10:1 or better.

Massively Parallel Processing (MPP)

A single machine cannot scan petabytes of data in seconds. Modern warehouses use Massively Parallel Processing (MPP): the query is split into fragments, each fragment runs on a separate node working on its slice of the data, and results are merged. Snowflake calls each compute cluster a virtual warehouse; BigQuery uses a serverless MPP engine called Dremel; Redshift distributes data across slices within nodes. The key insight is that storage and compute are often separated — you can spin up a large compute cluster for a heavy workload and scale it back down, paying only for what you use.

ETL and ELT: Getting Data Into the Warehouse

ETL (Extract, Transform, Load) is the traditional pipeline: data is pulled from source systems, cleaned and reshaped in a staging environment, and then loaded into the warehouse in its final form. This keeps raw messy data out of the warehouse but requires a powerful external processing layer.

ELT (Extract, Load, Transform) has become dominant with cloud warehouses: raw data is loaded first, then transformed inside the warehouse using SQL and tools like dbt. Because modern warehouses have near-unlimited compute, running heavy transformations inside them is often cheaper and faster than maintaining a separate processing cluster. The raw source data is preserved for reprocessing.

ETL pipeline: extract, transform, and columnar load

Raw operational data flows through extraction and transformation before landing as clean columnar data in the warehouse — then a column-scan query ignores unneeded columns.

Star Schema and Snowflake Schema

Warehouses use denormalized schemas optimized for fast reads rather than the normalized schemas used in OLTP systems. The two dominant patterns are the star schema and the snowflake schema.

In a star schema, a central fact table holds measurable events or metrics (e.g., each row is one order with columns for order_id, amount, quantity, and foreign keys). Surrounding dimension tables hold descriptive context: dim_customer, dim_product, dim_date, dim_store. The fact table has one foreign key per dimension, and each dimension is fully denormalized into a single flat table. The result looks like a star — one center, many arms — and joins are simple: fact joins directly to each dimension.

A snowflake schema normalizes dimension tables further. Instead of one flat dim_product, you might have dim_product referencing dim_category and dim_supplier. This reduces storage redundancy but requires more joins, which can slow analytical queries. Most teams prefer star schemas for their simplicity and query speed.

Star schema: fact table and dimension tables

A central fact table is surrounded by dimension tables. Each highlighted join path shows how a query assembles context from dimensions.

Data Warehouse vs Data Lake vs Lakehouse

As the volume and variety of data grew, two alternative paradigms emerged alongside the classical warehouse. A data lake (e.g., AWS S3 + Athena, HDFS + Hive) stores raw data in any format — CSV, JSON, Parquet, images, video — at very low cost, with schema applied only at query time (schema-on-read). This is flexible but can become a disorganized "data swamp" without governance. A lakehouse (Databricks Delta Lake, Apache Iceberg, Apache Hudi) blends both: data sits on cheap object storage in open table formats that support ACID transactions, schema enforcement, and fast SQL — giving warehouse-quality queries at data-lake costs.

	Data Warehouse	Data Lake	Lakehouse
Storage format	Proprietary columnar	Any (raw files)	Open columnar (Parquet + Delta/Iceberg)
Schema enforcement	Schema-on-write (strict)	Schema-on-read (flexible)	Both — enforced optionally
Data types	Structured only	Structured, semi-, unstructured	Structured + semi-structured
ACID transactions	Yes	No (usually)	Yes (Delta Lake, Iceberg)
Cost at scale	High (proprietary storage)	Low (object storage)	Low (object storage)
Query performance	Very fast	Slower (no indexes)	Fast (statistics + Z-ordering)
Best for	BI dashboards, SQL analytics	ML training data, raw archival	Unified analytics + ML platform

Modern Warehouse Platforms: Snowflake, BigQuery, and Redshift

Snowflake pioneered the separation of storage and compute in cloud warehouses. Data lives in S3 in Snowflake's micro-partition format; compute runs in independently-scalable virtual warehouses (clusters of VMs). You can run multiple workloads — BI dashboards and ETL pipelines — simultaneously on the same data without resource contention, and scale each cluster up or down in seconds.

Google BigQuery uses a serverless model: there are no clusters to manage. Queries are split across thousands of workers by the Dremel engine, billed by bytes scanned. BigQuery's slot model and automatic scaling make it attractive for intermittent workloads. Storage uses Capacitor, Google's proprietary columnar format, with automatic partitioning and clustering to minimize scans.

Amazon Redshift distributes data across compute nodes using a distribution key you choose. Choosing the right DISTKEY and SORTKEY is crucial for performance: if the distribution key matches your most common join or filter column, data moves minimally between nodes. Redshift Serverless removes manual cluster sizing. All three platforms support external tables, semi-structured JSON via SUPER/VARIANT types, and native integration with their cloud ecosystems.

SQL on a Star Schema: A Concrete Example

The query below joins a fact table to three dimension tables and aggregates revenue by quarter and product category — a typical warehouse query pattern. Note how all filtering happens on dimension columns, and the engine uses columnar storage to scan only the required columns in each table.

-- Star schema analytical query: revenue by quarter and category
SELECT
    d.year,
    d.quarter,
    p.category,
    SUM(f.revenue)        AS total_revenue,
    COUNT(DISTINCT f.order_id) AS order_count
FROM
    fact_orders       f
    JOIN dim_date     d ON f.date_id     = d.date_id
    JOIN dim_product  p ON f.product_id  = p.product_id
    JOIN dim_customer c ON f.customer_id = c.customer_id
WHERE
    d.year     = 2023
    AND c.segment = 'Enterprise'
GROUP BY
    d.year, d.quarter, p.category
ORDER BY
    d.quarter, total_revenue DESC;

Frequently Asked Questions

What is the difference between a data warehouse and a regular database?

A regular OLTP database (PostgreSQL, MySQL) is optimized for high-frequency, low-latency transactional operations — inserts, updates, and point reads. It uses row-oriented storage, normalized schemas, and row-level locking. A data warehouse is optimized for analytical queries over historical data: it uses columnar storage, denormalized star schemas, and massively parallel query execution. They answer different questions: the OLTP database tells you the current state of one record; the warehouse tells you aggregated trends across billions of records over years.

When should I use ETL vs ELT?

Use ETL when data must be cleaned or masked before it enters your environment — for regulatory reasons, or when raw data is too large/messy to load directly. Use ELT when you have a powerful cloud warehouse (Snowflake, BigQuery, Redshift) and want to preserve raw data for iterative transformation using tools like dbt. ELT has become the modern default because cloud warehouses have the compute to transform data internally at scale, and keeping raw data lets you re-derive derived tables as business logic changes.

Why does columnar storage help analytical queries so much?

An analytical query typically selects 2–5 columns out of 50+ and aggregates them over millions of rows. With row-oriented storage, reading those 2 columns still requires loading all 50 columns of every row into memory — 96% of the I/O is wasted. Columnar storage stores each column separately, so only the needed columns are read from disk. Combined with compression (dictionary encoding, run-length encoding), columnar layouts reduce I/O by 80–95% for typical analytical queries, which translates directly into faster query times and lower compute costs.

A data warehouse is where operational noise becomes analytical signal — columnar storage, MPP, and smart schemas turn raw transaction history into the insights that drive product and business decisions.
— alokknight Engineering