Batch Processing in System Design: Throughput, Idempotency & Checkpointing (Visualized)

Batch processing is the practice of collecting data over a period of time and then processing it together in scheduled, bounded chunks rather than one record at a time. Instead of reacting to every event the moment it arrives, a batch system lets work accumulate, then runs a job over the whole group on a fixed schedule. It optimizes for throughput — total records processed per hour — at the cost of latency, the delay between when data arrives and when its result is available.

This is the workhorse pattern behind nightly billing runs, data-warehouse ETL, machine-learning feature pipelines, log aggregation, and report generation. When you do not need an answer in the next second, but you do need to process billions of records efficiently, batch wins.

How a Batch Job Works

Every batch pipeline has the same skeleton: (1) data accumulates in a source — a database table, an object-store bucket, a message queue, or a set of files. (2) A scheduler triggers a job at a defined time or interval. (3) The job reads a bounded slice of that data — a window — splits it into chunks, processes them, and writes results to a sink. The window is what makes the input finite: a batch is never "all data forever," it is "all data for the period that just closed."

Batch vs Stream Processing

The defining contrast in data systems is batch vs stream. A stream processor handles each record as it arrives, producing results within milliseconds; it optimizes for latency. A batch processor waits, groups records into a window, and processes them all at once; it optimizes for throughput. The animation below shows the same firehose of incoming records handled both ways.

Batch vs stream: accumulate-then-process vs one-by-one

Top: a stream processor handles each record the instant it arrives (low latency). Bottom: a batch system accumulates records in a window, then on a schedule processes them all at once (high throughput).

	Batch processing	Stream processing
Input	Bounded window (a finite chunk)	Unbounded, continuous
Optimizes for	Throughput (records/hour)	Latency (ms per record)
Result freshness	Minutes to hours old	Sub-second
Trigger	Schedule (cron / Airflow)	Per-event arrival
Cost per record	Low (amortized over the batch)	Higher (per-event overhead)
Typical tools	Spark, Hadoop MapReduce, Airflow	Kafka Streams, Flink, Spark Streaming
Use cases	Billing, ETL, reports, ML training	Fraud alerts, live dashboards, monitoring

The Throughput-Over-Latency Trade-off

Batch wins on throughput because fixed costs are amortized. Opening a database connection, spinning up an executor, loading a model, or paying JVM warm-up costs the same whether you process one record or a million. By grouping work, the batch pays those costs once and spreads them across the whole window. The price is staleness: a record that arrives just after a window closes waits for the next run. Choosing a window size is exactly this dial — smaller windows mean fresher results but higher per-batch overhead; larger windows mean better efficiency but later answers.

Windowing & Scheduling

A window bounds the input — for example, "all orders created between 00:00 and 23:59 yesterday." A scheduler decides when the job runs. The simplest scheduler is cron, which fires a command on a time expression (0 2 * * * = 2 AM daily). For pipelines with dependencies — job B must wait for job A — teams reach for an orchestrator like Apache Airflow, which models the pipeline as a DAG of tasks, handles retries, backfills, and gives you a UI for failed runs.

# A minimal Airflow DAG: extract -> transform -> load, run daily at 02:00
from airflow import DAG
from airflow.operators.python import PythonOperator
import datetime as dt

with DAG(
    dag_id="daily_orders_etl",
    schedule="0 2 * * *",            # cron expression
    start_date=dt.datetime(2024, 1, 1),
    catchup=False,                    # don't backfill every missed day on first run
) as dag:

    def transform(ds, **_):
        # ds is the window date Airflow passes in; the batch is bounded to one day
        rows = read_orders(where=f"order_date = '{ds}'")
        write_warehouse(aggregate(rows))

    extract = PythonOperator(task_id="extract", python_callable=lambda **_: None)
    load    = PythonOperator(task_id="transform_load", python_callable=transform)

    extract >> load   # dependency: load runs only after extract succeeds

Chunking & Parallelism

A batch of a billion records does not fit on one machine, so the input is split into chunks (partitions) that are processed in parallel across many workers. This is the core idea of Hadoop MapReduce and Apache Spark: divide the data, run the same function on each chunk independently, then combine the results. Throughput scales with worker count — double the executors, roughly halve the wall-clock time — until a slow chunk (a "straggler") or a shuffle step becomes the bottleneck.

Chunking a batch across parallel workers

The batch is split into chunks (partitions) and distributed across workers that process them in parallel. The active worker glows; the counter tracks chunks completed. This is how Spark and MapReduce scale throughput.

Idempotency, Retries & Checkpointing

Batch jobs are long-running and will fail partway through — a node dies, a network blips, a deploy interrupts the run. The discipline that makes failure survivable rests on three ideas. Idempotency: running the same job twice produces the same result, never double-counting. You achieve it with deterministic keys and upserts instead of blind inserts, so a retry overwrites rather than duplicates. Retries: a failed chunk is automatically re-attempted, often with exponential backoff. Checkpointing: the job periodically records how far it got, so on restart it resumes from the last checkpoint instead of redoing everything from record zero.

Together these turn "the 6-hour job crashed at hour 5" from a catastrophe into a 20-minute resume. The animation below shows a chunk failing mid-run and the job recovering by re-running only that chunk from its last checkpoint — not the whole batch.

Failure recovery: reprocessing a failed chunk from a checkpoint

Chunks process left to right; a checkpoint is saved after each one. When a chunk fails, the job does NOT restart from zero — it retries just that chunk from the last checkpoint, then continues. Idempotent writes make the retry safe.

Failure Recovery & Reprocessing

Because a batch's input is a bounded, named window, batch systems get a superpower streams struggle with: reprocessing. If you discover a bug in yesterday's transformation, you can simply re-run the job for that window — the input is still sitting in the source, immutable. Orchestrators call this a backfill: re-execute a range of past windows to repair or recompute history. This only works safely when the job is idempotent, which is why idempotency is the non-negotiable foundation of good batch design.

Named Tools You Will See

cron triggers simple time-based jobs. Apache Airflow orchestrates multi-step pipelines as DAGs with retries, backfills, and dependency management. Hadoop MapReduce pioneered distributed batch over commodity clusters with HDFS storage. Apache Spark is the modern standard — in-memory, far faster than MapReduce, with rich APIs (SQL, DataFrames) for ETL and ML at scale. Most large data platforms combine these: Airflow schedules and orchestrates Spark jobs that chunk and process data across a cluster.

Frequently Asked Questions

When should I use batch processing instead of stream processing?

Use batch when you can tolerate result delay (minutes to hours) and want maximum efficiency on large volumes — billing, ETL, reporting, and ML training are classic fits. Use stream when results must be fresh within seconds, such as fraud detection, live dashboards, or alerting. Many systems run both in a hybrid (lambda or kappa) architecture: a fast streaming layer for approximate live results and a batch layer that recomputes accurate numbers later.

Why is idempotency so important in batch jobs?

Because batch jobs fail and get retried. If running a job twice double-counts records or sends duplicate emails, retries become dangerous and reprocessing is impossible. An idempotent job — using deterministic keys, upserts, and overwrite-by-partition writes — produces the same correct result no matter how many times it runs, which is what makes safe retries, checkpoint resumes, and backfills possible.

What is checkpointing and how does it speed up recovery?

Checkpointing periodically records the job's progress — which chunks or offsets are already done — to durable storage. When the job is restarted after a crash, it reads the last checkpoint and resumes from there instead of reprocessing everything from the beginning. For a long-running job, this turns a failure near the end from a full restart into a short resume of only the unfinished work.

Batch processing trades freshness for throughput: let the work pile up, process a bounded window all at once, and make every step idempotent so any failure is just a resume away.
— alokknight Engineering