Batch Processing in System Design: Throughput, Idempotency & Checkpointing (Visualized)
Batch processing collects data over time and processes it in scheduled, bounded chunks to maximize throughput. This guide covers batch vs stream processing, the throughput-over-latency trade-off, idempotency, retries, checkpointing, windowing, chunking and parallelism, and failure recovery โ with live animations of each idea and the tools that run it: Spark, Hadoop, Airflow, and cron.
Batch processing is the practice of collecting data over a period of time and then processing it together in scheduled, bounded chunks rather than one record at a time. Instead of reacting to every event the moment it arrives, a batch system lets work accumulate, then runs a job over the whole group on a fixed schedule. It optimizes for throughput โ total records processed per hour โ at the cost of latency, the delay between when data arrives and when its result is available.
This is the workhorse pattern behind nightly billing runs, data-warehouse ETL, machine-learning feature pipelines, log aggregation, and report generation. When you do not need an answer in the next second, but you do need to process billions of records efficiently, batch wins.
How a Batch Job Works
Every batch pipeline has the same skeleton: (1) data accumulates in a source โ a database table, an object-store bucket, a message queue, or a set of files. (2) A scheduler triggers a job at a defined time or interval. (3) The job reads a bounded slice of that data โ a window โ splits it into chunks, processes them, and writes results to a sink. The window is what makes the input finite: a batch is never "all data forever," it is "all data for the period that just closed."
Batch vs Stream Processing
The defining contrast in data systems is batch vs stream. A stream processor handles each record as it arrives, producing results within milliseconds; it optimizes for latency. A batch processor waits, groups records into a window, and processes them all at once; it optimizes for throughput. The animation below shows the same firehose of incoming records handled both ways.
| Batch processing | Stream processing | |
|---|---|---|
| Input | Bounded window (a finite chunk) | Unbounded, continuous |
| Optimizes for | Throughput (records/hour) | Latency (ms per record) |
| Result freshness | Minutes to hours old | Sub-second |
| Trigger | Schedule (cron / Airflow) | Per-event arrival |
| Cost per record | Low (amortized over the batch) | Higher (per-event overhead) |
| Typical tools | Spark, Hadoop MapReduce, Airflow | Kafka Streams, Flink, Spark Streaming |
| Use cases | Billing, ETL, reports, ML training | Fraud alerts, live dashboards, monitoring |
The Throughput-Over-Latency Trade-off
Batch wins on throughput because fixed costs are amortized. Opening a database connection, spinning up an executor, loading a model, or paying JVM warm-up costs the same whether you process one record or a million. By grouping work, the batch pays those costs once and spreads them across the whole window. The price is staleness: a record that arrives just after a window closes waits for the next run. Choosing a window size is exactly this dial โ smaller windows mean fresher results but higher per-batch overhead; larger windows mean better efficiency but later answers.
Windowing & Scheduling
A window bounds the input โ for example, "all orders created between 00:00 and 23:59 yesterday." A scheduler decides when the job runs. The simplest scheduler is cron, which fires a command on a time expression (0 2 * * * = 2 AM daily). For pipelines with dependencies โ job B must wait for job A โ teams reach for an orchestrator like Apache Airflow, which models the pipeline as a DAG of tasks, handles retries, backfills, and gives you a UI for failed runs.
# A minimal Airflow DAG: extract -> transform -> load, run daily at 02:00
from airflow import DAG
from airflow.operators.python import PythonOperator
import datetime as dt
with DAG(
dag_id="daily_orders_etl",
schedule="0 2 * * *", # cron expression
start_date=dt.datetime(2024, 1, 1),
catchup=False, # don't backfill every missed day on first run
) as dag:
def transform(ds, **_):
# ds is the window date Airflow passes in; the batch is bounded to one day
rows = read_orders(where=f"order_date = '{ds}'")
write_warehouse(aggregate(rows))
extract = PythonOperator(task_id="extract", python_callable=lambda **_: None)
load = PythonOperator(task_id="transform_load", python_callable=transform)
extract >> load # dependency: load runs only after extract succeedsChunking & Parallelism
A batch of a billion records does not fit on one machine, so the input is split into chunks (partitions) that are processed in parallel across many workers. This is the core idea of Hadoop MapReduce and Apache Spark: divide the data, run the same function on each chunk independently, then combine the results. Throughput scales with worker count โ double the executors, roughly halve the wall-clock time โ until a slow chunk (a "straggler") or a shuffle step becomes the bottleneck.
Idempotency, Retries & Checkpointing
Batch jobs are long-running and will fail partway through โ a node dies, a network blips, a deploy interrupts the run. The discipline that makes failure survivable rests on three ideas. Idempotency: running the same job twice produces the same result, never double-counting. You achieve it with deterministic keys and upserts instead of blind inserts, so a retry overwrites rather than duplicates. Retries: a failed chunk is automatically re-attempted, often with exponential backoff. Checkpointing: the job periodically records how far it got, so on restart it resumes from the last checkpoint instead of redoing everything from record zero.
Together these turn "the 6-hour job crashed at hour 5" from a catastrophe into a 20-minute resume. The animation below shows a chunk failing mid-run and the job recovering by re-running only that chunk from its last checkpoint โ not the whole batch.
Failure Recovery & Reprocessing
Because a batch's input is a bounded, named window, batch systems get a superpower streams struggle with: reprocessing. If you discover a bug in yesterday's transformation, you can simply re-run the job for that window โ the input is still sitting in the source, immutable. Orchestrators call this a backfill: re-execute a range of past windows to repair or recompute history. This only works safely when the job is idempotent, which is why idempotency is the non-negotiable foundation of good batch design.
Named Tools You Will See
cron triggers simple time-based jobs. Apache Airflow orchestrates multi-step pipelines as DAGs with retries, backfills, and dependency management. Hadoop MapReduce pioneered distributed batch over commodity clusters with HDFS storage. Apache Spark is the modern standard โ in-memory, far faster than MapReduce, with rich APIs (SQL, DataFrames) for ETL and ML at scale. Most large data platforms combine these: Airflow schedules and orchestrates Spark jobs that chunk and process data across a cluster.
Frequently Asked Questions
When should I use batch processing instead of stream processing?
Use batch when you can tolerate result delay (minutes to hours) and want maximum efficiency on large volumes โ billing, ETL, reporting, and ML training are classic fits. Use stream when results must be fresh within seconds, such as fraud detection, live dashboards, or alerting. Many systems run both in a hybrid (lambda or kappa) architecture: a fast streaming layer for approximate live results and a batch layer that recomputes accurate numbers later.
Why is idempotency so important in batch jobs?
Because batch jobs fail and get retried. If running a job twice double-counts records or sends duplicate emails, retries become dangerous and reprocessing is impossible. An idempotent job โ using deterministic keys, upserts, and overwrite-by-partition writes โ produces the same correct result no matter how many times it runs, which is what makes safe retries, checkpoint resumes, and backfills possible.
What is checkpointing and how does it speed up recovery?
Checkpointing periodically records the job's progress โ which chunks or offsets are already done โ to durable storage. When the job is restarted after a crash, it reads the last checkpoint and resumes from there instead of reprocessing everything from the beginning. For a long-running job, this turns a failure near the end from a full restart into a short resume of only the unfinished work.
Batch processing trades freshness for throughput: let the work pile up, process a bounded window all at once, and make every step idempotent so any failure is just a resume away.
โ alokknight Engineering
