Skip to content

Transactional Outbox Pattern: Reliable Event Publishing in Distributed Systems

Learn how the Transactional Outbox Pattern solves the dual-write problem in distributed systems, with practical implementations using PostgreSQL, DynamoDB, and CDC tools.

Abstract

The dual-write problem affects nearly every event-driven system I've worked with. When you need to update a database and publish an event atomically, you face an impossible choice: which operation fails when things go wrong? The Transactional Outbox Pattern provides a proven solution by writing both operations to the same database within a single transaction, then using a separate process to reliably publish events. This post covers practical implementations using polling publishers, Change Data Capture (CDC), and AWS serverless patterns.

The Dual-Write Problem

Here's a scenario I've encountered repeatedly: an order service needs to save an order to the database and publish an OrderCreated event. The naive approach looks like this:

typescript
async function createOrder(orderData: Order) {  // Step 1: Save to database  await db('orders').insert(orderData);
  // Step 2: Publish event  await messageQueue.publish('OrderCreated', orderData);}

What could go wrong? Everything.

Failure Scenario 1: Database succeeds, event publish fails

  • Network timeout to message broker
  • Message broker temporarily down
  • Your service crashes after database write
  • Result: Order exists in database, but inventory service never receives the event. Stock is never reserved.

Failure Scenario 2: Event publish succeeds, database fails

  • Database write violates constraint
  • Transaction rolled back due to deadlock
  • Database connection lost
  • Result: Inventory service receives event and reserves stock, but order doesn't exist. Data inconsistency.

Why Not Use Two-Phase Commit (2PC)?

You might ask: "Can't we use distributed transactions?" Technically yes, but the trade-offs make it impractical:

  • Performance overhead: Coordinating transactions across systems adds significant latency
  • Reduced availability: If any participant is down, the entire operation fails
  • Complexity: Implementing XA transactions correctly is difficult
  • Limited support: Many message brokers don't support 2PC
  • Coupling: Violates microservices independence principles

Working with distributed systems taught me that avoiding distributed transactions is better than trying to make them work reliably.

Understanding the Outbox Pattern

The Transactional Outbox Pattern solves the dual-write problem through a simple insight: instead of writing to two separate systems (database + message broker), write to two tables in the same database within a single ACID transaction.

Core Components

  1. Outbox Table: Stores events to be published, lives in the same database as your business data
  2. Business Transaction: Single ACID transaction writing to both business tables and outbox
  3. Message Relay: Separate process reads outbox and publishes to message broker
  4. Idempotent Consumers: Downstream services handle duplicate events correctly

How It Works

The key insight: either both the business data and event are committed, or neither is. This guarantees atomicity between your state changes and event publishing.

Implementation Approach 1: Polling Publisher

The simplest approach polls the outbox table periodically. Here's what works in practice:

Basic Implementation

typescript
// 1. Outbox table schema (PostgreSQL)CREATE TABLE outbox (  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),  aggregate_type VARCHAR(100) NOT NULL,  aggregate_id VARCHAR(100) NOT NULL,  event_type VARCHAR(100) NOT NULL,  payload JSONB NOT NULL,  created_at TIMESTAMP DEFAULT NOW(),  published BOOLEAN DEFAULT FALSE);
-- Critical index for efficient pollingCREATE INDEX idx_outbox_unpublishedON outbox(created_at)WHERE published = false;

Producer: Write to Outbox

typescript
async function createOrder(orderData: Order) {  await db.transaction(async (trx) => {    // Insert order    const order = await trx('orders').insert({      id: orderData.id,      customer_id: orderData.customerId,      total: orderData.total,      status: 'PENDING'    }).returning('*');
    // Insert event to outbox IN SAME TRANSACTION    await trx('outbox').insert({      id: uuid(),      aggregate_type: 'Order',      aggregate_id: order[0].id,      event_type: 'OrderCreated',      payload: {        orderId: order[0].id,        customerId: orderData.customerId,        total: orderData.total,        items: orderData.items      },      created_at: new Date()    });
    // Both succeed or both fail - atomicity guaranteed  });}

Publisher: Poll and Publish

typescript
async function publishOutboxEvents() {  // Use FOR UPDATE SKIP LOCKED to prevent concurrent processing  const events = await db.raw(`    SELECT * FROM outbox    WHERE published = false    ORDER BY created_at    LIMIT 100    FOR UPDATE SKIP LOCKED  `);
  for (const event of events.rows) {    try {      // Publish to message broker      await messageQueue.publish(event.event_type, {        messageId: event.id,  // Important for deduplication        aggregateId: event.aggregate_id,        payload: event.payload      });
      // Mark as published      await db('outbox')        .where('id', event.id)        .update({ published: true });
    } catch (error) {      console.error('Failed to publish event:', error);      // Will retry on next poll - at-least-once delivery    }  }}
// Run publisher every 5 secondssetInterval(publishOutboxEvents, 5000);

The FOR UPDATE SKIP LOCKED clause is critical: it prevents multiple publisher instances from processing the same events, enabling horizontal scaling.

When to Use Polling

Pros:

  • Simple to implement and understand
  • No additional infrastructure required
  • Works with any database
  • Easy to debug with SQL queries

Cons:

  • Polling adds database load
  • Latency depends on poll interval (5-10 seconds typical)
  • Less efficient than CDC for high volumes

Use polling when:

  • Low to medium event volumes (< 1000 events/minute)
  • Getting started quickly
  • Simple architectures
  • Your database doesn't support CDC

Implementation Approach 2: Change Data Capture (CDC)

For production systems at scale, CDC eliminates polling overhead by monitoring the database transaction log directly.

How CDC Works

Instead of polling the outbox table, CDC tools like Debezium monitor the database's Write-Ahead Log (PostgreSQL) or Binary Log (MySQL). When an outbox event is written, the CDC tool detects it and publishes to your message broker automatically.

PostgreSQL + Debezium Setup

sql
-- 1. Enable logical replication (requires PostgreSQL restart)ALTER SYSTEM SET wal_level = 'logical';-- Note: PostgreSQL must be restarted for wal_level change to take effect
-- 2. Create publication for outbox tableCREATE PUBLICATION outbox_publication FOR TABLE outbox;
-- 3. Grant replication rights to Debezium userALTER USER debezium_user WITH REPLICATION;

Debezium Configuration

json
{  "name": "outbox-connector",  "config": {    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",    "database.hostname": "postgres.example.com",    "database.port": "5432",    "database.user": "debezium_user",    "database.password": "${DB_PASSWORD}",    "database.dbname": "orders_db",    "database.server.name": "orders",    "table.include.list": "public.outbox",    "plugin.name": "pgoutput",    "transforms": "outbox",    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",    "transforms.outbox.table.field.event.type": "event_type",    "transforms.outbox.table.field.event.key": "aggregate_id",    "transforms.outbox.table.field.payload": "payload"  }}

Producer Code (Identical to Polling)

The beauty of CDC: your application code doesn't change. You still write to the outbox table in the same transaction. Debezium handles the publishing.

typescript
// Same code as polling approach - no changes neededasync function createOrder(orderData: Order) {  await db.transaction(async (trx) => {    await trx('orders').insert(orderData);    await trx('outbox').insert({      aggregate_type: 'Order',      aggregate_id: orderData.id,      event_type: 'OrderCreated',      payload: orderData    });  });  // Debezium automatically detects the new outbox row and publishes}

When to Use CDC

Pros:

  • Near real-time event publishing (< 1 second)
  • Minimal database overhead (reads WAL, not tables)
  • Scales to high volumes (100K+ events/sec)
  • Preserves event order per partition

Cons:

  • Complex infrastructure (Kafka Connect, Debezium)
  • Requires operational expertise
  • Database-specific setup (WAL configuration)
  • More expensive than serverless options

Use CDC when:

  • High event volumes (> 1000 events/minute)
  • Low latency requirements (< 1 second)
  • Production systems at scale
  • Already using Kafka ecosystem

AWS Implementation: DynamoDB + EventBridge Pipes

AWS provides a serverless outbox implementation using DynamoDB Streams and EventBridge Pipes. This is my preferred approach for AWS-native architectures.

Architecture

Implementation

typescript
// 1. Write both items in single transaction// Note: DynamoDB transactions have limits - max 100 items, 4MB aggregate sizeasync function createOrder(orderData: Order) {  await dynamodb.transactWrite({    TransactItems: [      {        Put: {          TableName: 'Orders',          Item: {            orderId: { S: orderData.id },            customerId: { S: orderData.customerId },            total: { N: orderData.total.toString() },            status: { S: 'PENDING' }          }        }      },      {        Put: {          TableName: 'Outbox',          Item: {            eventId: { S: uuid() },            aggregateType: { S: 'Order' },            aggregateId: { S: orderData.id },            eventType: { S: 'OrderCreated' },            payload: { S: JSON.stringify(orderData) },            timestamp: { N: Date.now().toString() }          }        }      }    ]  }).promise();}

Infrastructure as Code (AWS CDK)

typescript
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';import * as pipes from 'aws-cdk-lib/aws-pipes';import * as events from 'aws-cdk-lib/aws-events';
// 1. Create outbox table with streams enabledconst outboxTable = new dynamodb.Table(this, 'OutboxTable', {  partitionKey: { name: 'eventId', type: dynamodb.AttributeType.STRING },  stream: dynamodb.StreamViewType.NEW_IMAGE,  // Critical: stream new items  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,  removalPolicy: cdk.RemovalPolicy.DESTROY});
// 2. Create event busconst eventBus = new events.EventBus(this, 'OrderEventBus', {  eventBusName: 'order-events'});
// 3. Create EventBridge Pipe (NO Lambda needed!)new pipes.CfnPipe(this, 'OutboxPipe', {  source: outboxTable.tableStreamArn!,  target: eventBus.eventBusArn,  roleArn: pipeRole.roleArn,  sourceParameters: {    dynamoDbStreamParameters: {      startingPosition: 'LATEST',      batchSize: 10,      maximumRetryAttempts: 3,  // Note: Default is -1 (infinite retry)      deadLetterConfig: {        arn: dlqQueue.queueArn      }    }  },  targetParameters: {    eventBridgeEventBusParameters: {      detailType: 'OutboxEvent',      source: 'outbox.publisher'    }  }});

Why This Approach Works

No Lambda code for publishing: EventBridge Pipes automatically reads DynamoDB Streams and publishes to EventBridge. This eliminates:

  • Cold start latency
  • Lambda billing for publisher
  • Code to maintain for the relay

Built-in reliability: Pipes include retry logic, dead-letter queues, and monitoring out of the box.

Cost efficiency: You only pay for events processed, not for idle publisher infrastructure.

Cost Analysis

Based on a system processing 10 million events per month:

  • DynamoDB Streams: Free (included with DynamoDB)
  • EventBridge Pipes: 0.40/millionevents=0.40/million events = 4.00/month
  • EventBridge Event Bus: 1.00/millionevents=1.00/million events = 10.00/month
  • Total: ~$14/month for 10M events

Compare to Lambda polling approach:

  • Lambda invocations: 43,200/month (every minute) = ~$0.01
  • Lambda duration: 100ms avg × 43,200 = ~$0.50
  • RDS queries: Adds load to database
  • Total: Similar cost but higher operational complexity

Handling Ordering and Idempotency

Ordering Guarantees

The outbox pattern preserves ordering per partition, not globally across all events.

typescript
// Ensure events for same aggregate are orderedawait kafka.producer.send({  topic: 'order-events',  messages: [{    key: event.aggregateId,  // All events for ORDER-123 go to same partition    value: JSON.stringify(event.payload)  }]});

For DynamoDB Streams, use the aggregate ID as the partition key:

typescript
await dynamodb.put({  TableName: 'Outbox',  Item: {    aggregateId: 'ORDER-123',  // Partition key - ensures ordering    eventId: uuid(),  // Sort key    eventType: 'OrderCreated',    timestamp: Date.now()  }});

The Inbox Pattern: Consumer-Side Idempotency

The outbox pattern guarantees at-least-once delivery, which means events may be delivered multiple times. Consumers must handle duplicates.

The Inbox Pattern provides idempotent processing:

typescript
async function handleOrderCreatedEvent(event: OrderCreatedEvent) {  await db.transaction(async (trx) => {    // 1. Check if already processed    const existing = await trx('inbox')      .where('message_id', event.messageId)      .first();
    if (existing) {      console.log('Duplicate message, skipping:', event.messageId);      return;  // Idempotent - safe to skip    }
    // 2. Process the event (your business logic)    await trx('inventory')      .where('product_id', event.productId)      .decrement('quantity', event.quantity);
    // 3. Record as processed IN SAME TRANSACTION    await trx('inbox').insert({      message_id: event.messageId,      event_type: event.type,      processed_at: new Date()    });
    // Either all three operations succeed, or all fail  });
  // ACK to message broker only after successful commit  await messageQueue.ack(event.messageId);}

Inbox table schema:

sql
CREATE TABLE inbox (  message_id UUID PRIMARY KEY,  event_type VARCHAR(100),  processed_at TIMESTAMP DEFAULT NOW(),  payload JSONB  -- Optional: for debugging);
-- Cleanup old processed messages (run daily)DELETE FROM inboxWHERE processed_at < NOW() - INTERVAL '7 days';

Complete Pattern: Outbox + Inbox

Performance Considerations

Database Performance

Outbox table growth: Without cleanup, the outbox table grows indefinitely. I've seen this cause significant performance degradation.

sql
-- Strategy 1: Delete immediately after publishDELETE FROM outbox WHERE id = $1 AND published = true;
-- Strategy 2: Batch cleanup (run daily via cron)DELETE FROM outboxWHERE published = true  AND created_at < NOW() - INTERVAL '7 days';
-- Strategy 3: Table partitioning (PostgreSQL 10+)CREATE TABLE outbox_2025_12 PARTITION OF outbox  FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');
-- Drop old partitions (much faster than DELETE)DROP TABLE outbox_2025_11;

Index optimization: The partial index only indexes unpublished events, saving space:

sql
CREATE INDEX idx_outbox_unpublishedON outbox(created_at)WHERE published = false;

Polling Publisher Tuning

Poll interval trade-offs:

  • 1 second: Low latency, high database load
  • 5 seconds: Balanced (recommended for most cases)
  • 10+ seconds: Low overhead, higher latency

Batch size:

typescript
// Too small: many queries, inefficientconst batchSize = 10;
// Too large: long transactions, lock contentionconst batchSize = 10000;
// Optimal: balance efficiency and transaction lengthconst batchSize = 100;  // Recommended starting point

CDC Performance

Monitor replication lag to ensure Debezium keeps up:

sql
-- Check replication slot lagSELECT slot_name,       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lagFROM pg_replication_slotsWHERE slot_type = 'logical';

If lag grows, your WAL files accumulate and can fill disk. This is a real operational concern I've dealt with.

Common Pitfalls and Solutions

Pitfall 1: Unbounded Table Growth

Problem: Outbox table grows indefinitely, queries slow down.

Solution: Implement automatic cleanup in your publisher:

typescript
async function publishAndCleanup() {  // Publish events  await publishOutboxEvents();
  // Cleanup old published events (every 100 iterations)  if (cleanupCounter++ % 100 === 0) {    await db('outbox')      .where('published', true)      .where('created_at', '<', db.raw("NOW() - INTERVAL '7 days'"))      .delete();  }}

Pitfall 2: Message Relay Failure Goes Unnoticed

Problem: Publisher crashes, events pile up unpublished.

Solution: Monitor outbox age metrics:

typescript
async function checkOutboxHealth() {  const result = await db('outbox')    .where('published', false)    .min('created_at as oldest')    .first();
  if (!result.oldest) return;  // No unpublished events
  const ageMs = Date.now() - new Date(result.oldest).getTime();  const ageMinutes = ageMs / 60000;
  if (ageMinutes > 5) {    alerting.trigger('OUTBOX_LAG_HIGH', {      ageMinutes,      message: 'Outbox events not being published'    });  }}
// Run health check every minutesetInterval(checkOutboxHealth, 60000);

Pitfall 3: CDC Replication Slot Filling Disk

Problem: Debezium connector goes down, PostgreSQL WAL accumulates.

Solution: Monitor replication slots and set retention limits:

sql
-- Set WAL retention limitALTER SYSTEM SET wal_keep_size = '10GB';
-- Monitor slot statusSELECT slot_name, active,       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lagFROM pg_replication_slots;

Alert if a slot is inactive for more than 5 minutes, indicating a publisher failure.

Comparison with Other Patterns

Outbox vs. Event Sourcing

AspectOutbox PatternEvent Sourcing
PurposeReliable event publishingEvents as source of truth
Event LifetimeShort-lived (deleted after publish)Permanent append-only log
State StorageCurrent state in tablesDerived from events
ComplexityLowHigh
Query ModelDirect database queriesRequires projections/CQRS
Best ForE-commerce orders, workflowsBanking, audit systems

Key difference: In event sourcing, events are the permanent record. In outbox, events are a communication mechanism.

Outbox vs. Saga Pattern

The outbox pattern complements the saga pattern. Use outbox within each service participating in a saga:

typescript
// Order Service publishes OrderCreated via outboxawait db.transaction(async (trx) => {  await trx('orders').insert(order);  await trx('outbox').insert({ event_type: 'OrderCreated', payload: order });});
// Saga Orchestrator receives OrderCreated, publishes commands via its own outboxawait db.transaction(async (trx) => {  await trx('saga_state').insert({ saga_id: orderId, step: 'INVENTORY_PENDING' });  await trx('outbox').insert({ event_type: 'ReserveInventory', payload: { orderId } });});

Decision Framework

Use this framework to choose the right implementation:

Choose Polling when:

  • Event volume < 1000/minute
  • Getting started quickly
  • Simple architecture preferred
  • Database doesn't support CDC

Choose CDC when:

  • Event volume > 1000/minute
  • Need < 1 second latency
  • Production system at scale
  • Already using Kafka

Choose DynamoDB + EventBridge when:

  • Building on AWS
  • Want serverless architecture
  • Minimal operational overhead desired
  • Cost-effective for moderate volumes

Production Readiness Checklist

Before deploying the outbox pattern to production:

  • Cleanup strategy: Automated deletion of published events
  • Monitoring: Outbox age, backlog size, publisher health
  • Alerting: Lag exceeds threshold, publisher failures
  • Idempotency: Inbox pattern or idempotency keys implemented
  • Ordering: Partition key strategy for event ordering
  • Dead Letter Queue: Failed events routed for investigation
  • Schema versioning: Event payload versioning strategy
  • Load testing: Verified at expected throughput
  • Runbook: Documented recovery procedures
  • Backup strategy: For outbox and inbox tables

Key Takeaways

Working with the outbox pattern across multiple systems taught me these lessons:

  1. Start simple: Begin with polling publishers. Move to CDC only when you need the performance.

  2. Monitor lag aggressively: The time between event creation and publishing is your most important metric. If this grows, your system is degrading.

  3. Idempotency is non-negotiable: At-least-once delivery means duplicates will happen. Design for it from day one.

  4. Clean up ruthlessly: Outbox tables that grow unbounded will eventually cause production issues. Automate cleanup.

  5. Partition wisely: Event ordering within a partition is guaranteed. Use aggregate IDs as partition keys.

  6. AWS makes it easier: DynamoDB + EventBridge Pipes provides a production-ready outbox with minimal code.

The outbox pattern isn't just theory; it's a battle-tested solution to the dual-write problem that I've relied on for building reliable event-driven systems. The implementations shown here are production-ready patterns you can adapt to your specific requirements.

Further Reading

References

Related Posts