ETL Pipelines in System Design: Extract, Transform, Load Explained (Visualized)
ETL β Extract, Transform, Load β is the backbone of every data warehouse. Raw records are pulled from source systems, cleaned and reshaped, then delivered to an analytical store. This guide covers every stage, ETL vs ELT, batch vs incremental loads, orchestration with Airflow and dbt, and idempotency β with live animations of each concept.
ETL (Extract, Transform, Load) is a data integration pattern in which records are pulled from one or more source systems, reshaped into a consistent analytical schema, and written into a target data store such as a warehouse or data lake. It is the foundational pipeline that turns operational data scattered across databases, SaaS APIs, and event streams into the unified, query-ready datasets that power business intelligence and machine learning.
Without ETL, analysts must query production OLTP databases directly β slowing them down, risking data quality issues, and making it impossible to join records across systems. ETL pipelines decouple analytical workloads from operational ones, giving each the resources and schema best suited to its job.
The Three Stages: Extract, Transform, Load
Each stage of an ETL pipeline has a distinct responsibility. Extract connects to source systems β relational databases, REST APIs, event queues, flat files β and reads raw records. The extractor must handle network failures, pagination, authentication, and source rate limits without corrupting the data it collects. Transform is where the real work happens: rows are validated, nulls are filled, currencies converted, duplicates removed, and fields renamed to match the warehouse schema. Load writes the cleaned records into the target β either appending new rows, upserting by key, or replacing entire partitions. Getting the load strategy wrong causes either missing data or expensive full-table scans.
Staging, Cleaning, and Deduplication
Best-practice ETL never writes raw data directly into the warehouse. Instead it lands extracted records in a staging area β a temporary table or object-storage prefix β before transformation begins. This two-step approach provides a checkpoint: if the transform fails, you can re-run it without hitting the source again. Staging also enables auditing and replay.
Data cleaning within the transform stage covers: rejecting rows whose required fields are null, standardising date formats (e.g. DD/MM/YYYY to ISO 8601), normalising phone numbers and country codes, and trimming whitespace from string columns. Deduplication removes rows that appear more than once because the source system fired the same event twice (a common occurrence with at-least-once delivery queues). A typical strategy assigns a deterministic row_hash from the natural key and drops duplicates with a ROW_NUMBER() OVER (PARTITION BY row_hash ORDER BY ingested_at) window function, keeping only the first occurrence.
Schema mapping resolves the mismatch between source column names and warehouse column names β for example, mapping cust_id in the CRM to customer_id in the warehouse, or splitting a full_name column into first_name and last_name. A mapping dictionary (often stored as YAML or in a metadata table) makes this auditable and easy to change without touching pipeline code.
ETL vs ELT: Where Does the Transform Happen?
Classic ETL performs all transformations on a dedicated processing server before writing to the warehouse. This was necessary when warehouses were expensive on-premises systems priced per CPU-hour: you did not want to store or compute raw, messy data inside them. Modern cloud warehouses β BigQuery, Snowflake, Redshift, Databricks β charge per query or per byte scanned and have near-limitless horizontal scaling. This shift gave rise to ELT: Extract, Load raw data first, then Transform inside the warehouse using SQL.
ELT wins when your warehouse is cheap and powerful, your raw data is valuable to preserve (data lake semantics), and your transformations are best expressed in SQL. ETL still wins when transformations require complex imperative code (ML feature engineering, binary parsing), when the warehouse cannot handle raw volume efficiently, or when governance requires masking PII before it ever lands in the warehouse.
| ETL | ELT | |
|---|---|---|
| Transform location | Separate processing server (Spark, Python) | Inside the warehouse (SQL, dbt) |
| Raw data preserved? | No β only cleaned data lands in warehouse | Yes β raw layer always available for replay |
| Best warehouse fit | On-premises or high-cost OLAP stores | Cloud warehouses (BigQuery, Snowflake, Redshift) |
| Latency | Higher β extra hop through transform server | Lower β fewer network hops |
| PII / compliance | Can mask before data enters warehouse | Raw PII lands in warehouse; mask in transform layer |
| Tool examples | Apache Spark, Talend, Informatica | Fivetran + dbt, Airbyte + dbt, BigQuery DTS |
Batch vs Incremental Loads
The first question every ETL architect faces is: how much data do I move on each run? Full batch loads drop and reload the entire table every run. They are simple and safe β no bookkeeping, no edge cases β but cost grows linearly with table size and quickly becomes prohibitive at scale. Incremental loads only move rows that have changed since the last run, identified by a updated_at timestamp, a CDC log offset, or a partition date. They are far cheaper but require careful tracking of the high-water mark and handling of late-arriving data.
A common incremental pattern: store the last successful run's max(updated_at) in a metadata table, then on the next run extract WHERE updated_at > last_watermark. Soft-deleted rows (marked is_deleted=true rather than actually removed) are caught naturally. Hard-deleted rows are not β for those you need CDC (Change Data Capture) via a tool like Debezium that reads the database's binary replication log.
Orchestration: Airflow DAGs, dbt, and Spark
Real pipelines are not a single script β they are graphs of dependent tasks. Apache Airflow is the most widely deployed orchestrator: you define a pipeline as a Python DAG (Directed Acyclic Graph) where each node is an operator (BashOperator, PythonOperator, BigQueryOperator). Airflow schedules DAGs on a cron schedule, retries failed tasks, sends alerts, and provides a UI to inspect every run. A typical DAG for an ETL pipeline looks like: extract_orders >> stage_orders >> deduplicate >> transform >> load_warehouse >> notify_success.
dbt (data build tool) focuses on the T in ELT: it compiles SQL templates into warehouse queries, manages model dependencies, runs tests (not-null, unique, referential integrity), and generates documentation. Each dbt model is a SELECT statement; dbt figures out the execution order from ref() calls. Apache Spark is used when transformations are too complex or too voluminous for SQL β ML feature pipelines, binary parsing, fuzzy deduplication. Spark reads from object storage or Kafka, processes in memory across a cluster, and writes Parquet files back to the data lake.
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-eng',
'retries': 3,
'retry_delay': timedelta(minutes=5),
'email_on_failure': True,
}
with DAG(
dag_id='orders_etl',
schedule_interval='@hourly',
start_date=datetime(2024, 1, 1),
catchup=False,
default_args=default_args,
) as dag:
extract = PythonOperator(
task_id='extract_orders',
python_callable=extract_from_postgres, # reads since last watermark
)
stage = PythonOperator(
task_id='stage_to_gcs',
python_callable=upload_to_staging_bucket,
)
transform = BigQueryInsertJobOperator(
task_id='transform_and_deduplicate',
configuration={
'query': {
'query': '{% include "sql/transform_orders.sql" %}',
'useLegacySql': False,
}
},
)
load = BigQueryInsertJobOperator(
task_id='load_to_warehouse',
configuration={
'query': {
'query': '{% include "sql/merge_orders.sql" %}',
'useLegacySql': False,
}
},
)
extract >> stage >> transform >> loadIdempotency and Safe Reprocessing
The single most important property of a production ETL pipeline is idempotency: running the same pipeline twice must produce the same result as running it once. Without idempotency, a network blip or a failed task that gets retried silently duplicates records in the warehouse β a correctness bug that is extremely hard to diagnose after the fact.
There are three classic idempotent load strategies. DELETE + INSERT by partition: before inserting rows for date 2024-03-22, delete all existing rows for that date, then insert the new batch. Rerunning a day's pipeline replaces rather than appends. MERGE / UPSERT: match on a natural key and update-if-exists, insert-if-not. Works for slow-changing dimensions and CDC streams. Write-then-swap: write the new data to a temporary table, then atomically rename it to the production table using a DDL swap β zero downtime and fully reversible. Each strategy fits different access patterns; the key is choosing one deliberately rather than relying on append-only loads that silently stack up duplicates.
Frequently Asked Questions
What is the difference between ETL and a data pipeline?
A data pipeline is the general term for any automated flow of data from one place to another β it could be a real-time stream, a simple file copy, or a machine learning training job. ETL is a specific pattern within that space: it always involves an explicit Extract, Transform, and Load stage, and it usually targets an analytical data store. Every ETL is a data pipeline, but not every data pipeline is ETL.
When should I use a streaming pipeline instead of batch ETL?
Use streaming (Kafka + Flink, Spark Structured Streaming, Dataflow) when your business needs data in the warehouse within seconds or low single-digit minutes β dashboards updated in real time, fraud detection on live transactions, or operational alerts. Batch ETL is simpler, cheaper, and more robust for use cases that tolerate hourly or daily latency, which covers the vast majority of BI and reporting workloads. Start with batch ETL and migrate hot paths to streaming only when a concrete SLA demands it.
How do I handle schema changes in ETL pipelines?
Schema drift β a source adding a column, changing a data type, or renaming a field β is one of the most common causes of ETL failures. The safest approach is schema-on-read in the staging layer (store raw JSON or Parquet with all source columns) plus explicit column selection in the transform layer. When the source adds a column, your staging table captures it automatically; your transform model ignores it until you intentionally promote it. Tools like dbt's dbt_utils.get_column_values and BigQuery's schema auto-detection handle many common cases. For critical schemas, add a contract check β assert that required columns are present and have the expected types β so the pipeline fails loudly at extract time instead of silently corrupting downstream tables.
A pipeline that runs twice and gives the same answer is worth ten that run once perfectly. Build idempotency in from day one β retrofitting it after data is corrupted is never fun.
β alokknight Engineering
