Event-Driven Architecture: How We Replaced a $50K/Month Monolith

The Problem: A Monolith That Couldn't Scale

Our client had a Django monolith handling order processing. Every order triggered a synchronous chain: validate inventory → charge payment → send confirmation email → update analytics → notify warehouse. If the email service was slow, the entire checkout hung. If analytics was down, orders failed.

The system processed 200 orders/minute during normal hours but crashed during flash sales at 2,000/minute. They were paying $50K/month in over-provisioned servers to compensate.

Before: The Synchronous Chain of Pain

# The old way: everything synchronous, everything coupled
def process_order(request):
    order = Order.objects.create(**request.data)
    
    # If ANY of these fail, the order fails
    inventory_service.reserve(order)        # 200ms
    payment_service.charge(order)           # 800ms
    email_service.send_confirmation(order)  # 500ms
    analytics_service.track(order)          # 300ms
    warehouse_service.notify(order)         # 400ms
    
    # Total: 2200ms minimum response time
    # If email is down: order fails
    # If analytics is slow: checkout is slow
    return Response({'order_id': order.id})

The Architecture: Events + Commands

We split the system into two concepts:

Commands (things that MUST succeed): inventory reservation + payment. These stay synchronous because the order depends on them.

Events (things that SHOULD happen): email, analytics, warehouse notification. These are published as events and processed asynchronously. If they fail, they retry — the customer doesn't wait.

After: Event-Driven with Celery + Redis

# The new way: synchronous for critical path, events for the rest
def process_order(request):
    order = Order.objects.create(**request.data)
    
    # Critical path: must succeed (synchronous)
    inventory_service.reserve(order)   # 200ms
    payment_service.charge(order)      # 800ms
    
    # Publish event: fire-and-forget (async)
    publish_event('order.completed', {
        'order_id': order.id,
        'customer_email': order.customer.email,
        'total': str(order.total),
    })
    
    # Total: 1000ms response time
    # Email down? Order still succeeds. Retried later.
    return Response({'order_id': order.id})


# Event handlers (separate workers)
@celery_app.task(bind=True, max_retries=5, default_retry_delay=60)
def handle_order_completed(self, event_data):
    """Each handler is independent and retriable"""
    try:
        email_service.send_confirmation(event_data)
    except Exception as exc:
        self.retry(exc=exc, countdown=2 ** self.request.retries * 60)

The Event Bus Pattern

We built a lightweight event bus using Redis Streams (not just pub/sub — streams give you durability and consumer groups):

import redis
import json
from datetime import datetime

class EventBus:
    def __init__(self):
        self.redis = redis.Redis(host='redis', port=6379, db=0)
    
    def publish(self, event_type: str, data: dict):
        """Publish event to Redis Stream"""
        event = {
            'type': event_type,
            'data': json.dumps(data),
            'timestamp': datetime.utcnow().isoformat(),
        }
        self.redis.xadd(f'events:{event_type}', event)
    
    def subscribe(self, event_type: str, group: str, consumer: str):
        """Read events with consumer group (each event processed once)"""
        try:
            self.redis.xgroup_create(
                f'events:{event_type}', group, id='0', mkstream=True
            )
        except redis.ResponseError:
            pass  # Group already exists
        
        while True:
            events = self.redis.xreadgroup(
                group, consumer,
                {f'events:{event_type}': '>'},
                count=10, block=5000
            )
            for stream, messages in events:
                for msg_id, fields in messages:
                    yield msg_id, json.loads(fields[b'data'])
                    self.redis.xack(stream, group, msg_id)

Results After 3 Months

Metric	Before	After
Checkout response time	2,200ms	1,000ms
Peak throughput	200 orders/min	5,000 orders/min
Monthly server cost	$50,000	$12,000
Email service outage impact	All orders fail	Orders succeed, emails queued
Deployment risk	All-or-nothing	Deploy services independently

When NOT to Use Event-Driven Architecture

Event-driven is not always the answer. Avoid it when:

1. You need strict ordering. Events can arrive out of order. If step B must happen after step A, keep them synchronous.

2. You need immediate consistency. Events are eventually consistent. If the user must see the result immediately, don't use async events.

3. Your team is small. Event-driven adds operational complexity. A team of 2-3 developers is better served by a well-structured monolith.

4. You have less than 100 requests/minute. A synchronous monolith handles this fine. Don't over-engineer.

Start with a monolith. Extract events when you feel the pain. The worst architecture is the one built for problems you don't have yet.
— alokknight Engineering

The Problem: A Monolith That Couldn't Scale

The system processed 200 orders/minute during normal hours but crashed during flash sales at 2,000/minute. They were paying $50K/month in over-provisioned servers to compensate.

Before: The Synchronous Chain of Pain

# The old way: everything synchronous, everything coupled def process_order(request): order = Order.objects.create(**request.data) # If ANY of these fail, the order fails inventory_service.reserve(order) # 200ms payment_service.charge(order) # 800ms email_service.send_confirmation(order) # 500ms analytics_service.track(order) # 300ms warehouse_service.notify(order) # 400ms # Total: 2200ms minimum response time # If email is down: order fails # If analytics is slow: checkout is slow return Response({'order_id': order.id})

The Architecture: Events + Commands

We split the system into two concepts:

Commands (things that MUST succeed): inventory reservation + payment. These stay synchronous because the order depends on them.

Events (things that SHOULD happen): email, analytics, warehouse notification. These are published as events and processed asynchronously. If they fail, they retry — the customer doesn't wait.

After: Event-Driven with Celery + Redis

# The new way: synchronous for critical path, events for the rest def process_order(request): order = Order.objects.create(**request.data) # Critical path: must succeed (synchronous) inventory_service.reserve(order) # 200ms payment_service.charge(order) # 800ms # Publish event: fire-and-forget (async) publish_event('order.completed', { 'order_id': order.id, 'customer_email': order.customer.email, 'total': str(order.total), }) # Total: 1000ms response time # Email down? Order still succeeds. Retried later. return Response({'order_id': order.id}) # Event handlers (separate workers) @celery_app.task(bind=True, max_retries=5, default_retry_delay=60) def handle_order_completed(self, event_data): """Each handler is independent and retriable""" try: email_service.send_confirmation(event_data) except Exception as exc: self.retry(exc=exc, countdown=2 ** self.request.retries * 60)

The Event Bus Pattern

We built a lightweight event bus using Redis Streams (not just pub/sub — streams give you durability and consumer groups):

import redis import json from datetime import datetime class EventBus: def __init__(self): self.redis = redis.Redis(host='redis', port=6379, db=0) def publish(self, event_type: str, data: dict): """Publish event to Redis Stream""" event = { 'type': event_type, 'data': json.dumps(data), 'timestamp': datetime.utcnow().isoformat(), } self.redis.xadd(f'events:{event_type}', event) def subscribe(self, event_type: str, group: str, consumer: str): """Read events with consumer group (each event processed once)""" try: self.redis.xgroup_create( f'events:{event_type}', group, id='0', mkstream=True ) except redis.ResponseError: pass # Group already exists while True: events = self.redis.xreadgroup( group, consumer, {f'events:{event_type}': '>'}, count=10, block=5000 ) for stream, messages in events: for msg_id, fields in messages: yield msg_id, json.loads(fields[b'data']) self.redis.xack(stream, group, msg_id)

Metric

Before

After

Checkout response time

2,200ms

1,000ms

Peak throughput

200 orders/min

5,000 orders/min

Monthly server cost

$50,000

$12,000

Email service outage impact

All orders fail

Orders succeed, emails queued

Deployment risk

All-or-nothing

Deploy services independently