Event-Driven Architecture: How We Replaced a $50K/Month Monolith
A practical guide to event-driven architecture with real before/after code, showing how we decomposed a tightly-coupled order processing system into a resilient, scalable event pipeline.
The Problem: A Monolith That Couldn't Scale
Our client had a Django monolith handling order processing. Every order triggered a synchronous chain: validate inventory → charge payment → send confirmation email → update analytics → notify warehouse. If the email service was slow, the entire checkout hung. If analytics was down, orders failed.
The system processed 200 orders/minute during normal hours but crashed during flash sales at 2,000/minute. They were paying $50K/month in over-provisioned servers to compensate.
Before: The Synchronous Chain of Pain
# The old way: everything synchronous, everything coupled
def process_order(request):
order = Order.objects.create(**request.data)
# If ANY of these fail, the order fails
inventory_service.reserve(order) # 200ms
payment_service.charge(order) # 800ms
email_service.send_confirmation(order) # 500ms
analytics_service.track(order) # 300ms
warehouse_service.notify(order) # 400ms
# Total: 2200ms minimum response time
# If email is down: order fails
# If analytics is slow: checkout is slow
return Response({'order_id': order.id})The Architecture: Events + Commands
We split the system into two concepts:
Commands (things that MUST succeed): inventory reservation + payment. These stay synchronous because the order depends on them.
Events (things that SHOULD happen): email, analytics, warehouse notification. These are published as events and processed asynchronously. If they fail, they retry — the customer doesn't wait.
After: Event-Driven with Celery + Redis
# The new way: synchronous for critical path, events for the rest
def process_order(request):
order = Order.objects.create(**request.data)
# Critical path: must succeed (synchronous)
inventory_service.reserve(order) # 200ms
payment_service.charge(order) # 800ms
# Publish event: fire-and-forget (async)
publish_event('order.completed', {
'order_id': order.id,
'customer_email': order.customer.email,
'total': str(order.total),
})
# Total: 1000ms response time
# Email down? Order still succeeds. Retried later.
return Response({'order_id': order.id})
# Event handlers (separate workers)
@celery_app.task(bind=True, max_retries=5, default_retry_delay=60)
def handle_order_completed(self, event_data):
"""Each handler is independent and retriable"""
try:
email_service.send_confirmation(event_data)
except Exception as exc:
self.retry(exc=exc, countdown=2 ** self.request.retries * 60)The Event Bus Pattern
We built a lightweight event bus using Redis Streams (not just pub/sub — streams give you durability and consumer groups):
import redis
import json
from datetime import datetime
class EventBus:
def __init__(self):
self.redis = redis.Redis(host='redis', port=6379, db=0)
def publish(self, event_type: str, data: dict):
"""Publish event to Redis Stream"""
event = {
'type': event_type,
'data': json.dumps(data),
'timestamp': datetime.utcnow().isoformat(),
}
self.redis.xadd(f'events:{event_type}', event)
def subscribe(self, event_type: str, group: str, consumer: str):
"""Read events with consumer group (each event processed once)"""
try:
self.redis.xgroup_create(
f'events:{event_type}', group, id='0', mkstream=True
)
except redis.ResponseError:
pass # Group already exists
while True:
events = self.redis.xreadgroup(
group, consumer,
{f'events:{event_type}': '>'},
count=10, block=5000
)
for stream, messages in events:
for msg_id, fields in messages:
yield msg_id, json.loads(fields[b'data'])
self.redis.xack(stream, group, msg_id)Results After 3 Months
| Metric | Before | After |
|---|---|---|
| Checkout response time | 2,200ms | 1,000ms |
| Peak throughput | 200 orders/min | 5,000 orders/min |
| Monthly server cost | $50,000 | $12,000 |
| Email service outage impact | All orders fail | Orders succeed, emails queued |
| Deployment risk | All-or-nothing | Deploy services independently |
When NOT to Use Event-Driven Architecture
Event-driven is not always the answer. Avoid it when:
1. You need strict ordering. Events can arrive out of order. If step B must happen after step A, keep them synchronous.
2. You need immediate consistency. Events are eventually consistent. If the user must see the result immediately, don't use async events.
3. Your team is small. Event-driven adds operational complexity. A team of 2-3 developers is better served by a well-structured monolith.
4. You have less than 100 requests/minute. A synchronous monolith handles this fine. Don't over-engineer.
Start with a monolith. Extract events when you feel the pain. The worst architecture is the one built for problems you don't have yet.
— alokknight Engineering
