Skip to content

Saga Pattern for Distributed Transactions: Maintaining Consistency Without ACID

A comprehensive guide to implementing the Saga pattern for managing distributed transactions across microservices with AWS Step Functions and EventBridge, including idempotency, compensation logic, and production-ready patterns.

Abstract

The Saga pattern solves one of the most challenging problems in microservices architectures: maintaining data consistency across services without traditional ACID transactions. In this guide, I'll share practical patterns for implementing sagas using AWS Step Functions orchestration and EventBridge choreography, designing effective compensation logic, ensuring idempotency, and handling the isolation challenges that arise in distributed systems. You'll learn when to choose orchestration versus choreography, how to implement semantic locking to prevent concurrent saga conflicts, and the critical observability patterns needed for production environments.

The Distributed Transaction Problem

When building microservices, you quickly encounter a fundamental challenge: coordinating multi-step transactions across independent services. Traditional ACID transactions don't work across service boundaries, and two-phase commit (2PC) creates tight coupling and single points of failure.

Consider a typical e-commerce order flow: creating an order, reserving inventory, processing payment, and confirming shipment. Each step involves a different microservice with its own database. If payment processing fails after inventory has been reserved, you need a reliable way to roll back the inventory reservation. Without proper patterns, services become inconsistent; orders created, payments failed, inventory not restored.

The Saga pattern provides a structured approach to this problem through local transactions and compensating transactions. Instead of a distributed transaction, you execute a sequence of local transactions where each step can be undone by a compensating transaction if a later step fails.

Saga Pattern Fundamentals

A saga is a sequence of local transactions where:

  1. Each transaction updates data within a single service
  2. Each transaction publishes an event or message to trigger the next transaction
  3. If a transaction fails, the saga executes compensating transactions to undo completed steps
  4. The system achieves eventual consistency instead of immediate ACID consistency

Key characteristics that make sagas work:

  • Eventual Consistency: The system becomes consistent over time, not immediately
  • Local Transactions: Each service manages its own database with local ACID guarantees
  • Compensating Transactions: Explicit rollback logic for each forward step
  • Idempotency: All saga steps must be safe to retry
  • Observability: Critical for debugging distributed flows

When to use the Saga pattern:

  • Microservices architecture with multiple databases
  • Business processes spanning multiple services
  • Cannot use distributed transactions due to performance or coupling concerns
  • Acceptable to have temporary inconsistency
  • Typically 3-5 steps maximum (complexity increases rapidly beyond this)

Orchestration vs Choreography

There are two main approaches to coordinating sagas: orchestration and choreography. Understanding when to use each is critical for successful implementation.

Choreography: Event-Driven Coordination

In choreography, services coordinate through domain events without a central coordinator. Each service knows what to do when it receives an event.

Advantages:

  • Loose coupling between services
  • No single point of failure
  • Scales well for independent services
  • Natural fit for event-driven architecture

Disadvantages:

  • Control flow not visible in one place
  • Difficult to understand complete saga flow
  • Debugging complexity (distributed logic)
  • Harder to track saga state
  • Risk of cyclic dependencies

Best for:

  • 3-4 services maximum
  • Independent, autonomous services
  • Event-driven architecture already in place
  • Simple linear flows

Orchestration: Centralized Coordination

In orchestration, a central coordinator (typically AWS Step Functions) manages the saga flow, telling each service what to do.

Advantages:

  • Clear control flow visualization
  • Easier debugging and monitoring
  • Centralized error handling
  • State management built-in
  • Better for complex flows

Disadvantages:

  • Orchestrator is coordination point
  • Services coupled to orchestrator
  • Orchestrator must know all services

Best for:

  • Complex multi-step workflows
  • Need visibility into saga state
  • Human approval steps
  • More than 4 services involved
  • Strict ordering requirements

Decision Framework

Use this decision logic when choosing your approach:

Orchestration Implementation with AWS Step Functions

Let me show you a production-ready implementation of an e-commerce order saga using AWS Step Functions orchestration with AWS CDK.

Infrastructure Setup

typescript
import * as cdk from 'aws-cdk-lib';import * as sfn from 'aws-cdk-lib/aws-stepfunctions';import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';import * as lambda from 'aws-cdk-lib/aws-lambda';import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';import { Construct } from 'constructs';
export class OrderSagaStack extends cdk.Stack {  constructor(scope: Construct, id: string) {    super(scope, id);
    // DynamoDB tables for saga state persistence    const ordersTable = new dynamodb.Table(this, 'Orders', {      partitionKey: { name: 'orderId', type: dynamodb.AttributeType.STRING },      sortKey: { name: 'transactionId', type: dynamodb.AttributeType.STRING },      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST    });
    const inventoryTable = new dynamodb.Table(this, 'Inventory', {      partitionKey: { name: 'productId', type: dynamodb.AttributeType.STRING },      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST    });
    const paymentsTable = new dynamodb.Table(this, 'Payments', {      partitionKey: { name: 'paymentId', type: dynamodb.AttributeType.STRING },      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST    });
    // Forward transaction Lambdas    const createOrder = new lambda.Function(this, 'CreateOrder', {      runtime: lambda.Runtime.NODEJS_20_X,      handler: 'createOrder.handler',      code: lambda.Code.fromAsset('lambda'),      environment: {        ORDERS_TABLE: ordersTable.tableName      }    });
    const reserveInventory = new lambda.Function(this, 'ReserveInventory', {      runtime: lambda.Runtime.NODEJS_20_X,      handler: 'reserveInventory.handler',      code: lambda.Code.fromAsset('lambda'),      environment: {        INVENTORY_TABLE: inventoryTable.tableName      }    });
    const processPayment = new lambda.Function(this, 'ProcessPayment', {      runtime: lambda.Runtime.NODEJS_20_X,      handler: 'processPayment.handler',      code: lambda.Code.fromAsset('lambda'),      environment: {        PAYMENTS_TABLE: paymentsTable.tableName      }    });
    const confirmOrder = new lambda.Function(this, 'ConfirmOrder', {      runtime: lambda.Runtime.NODEJS_20_X,      handler: 'confirmOrder.handler',      code: lambda.Code.fromAsset('lambda'),      environment: {        ORDERS_TABLE: ordersTable.tableName      }    });
    // Compensating transaction Lambdas (rollback)    const cancelOrder = new lambda.Function(this, 'CancelOrder', {      runtime: lambda.Runtime.NODEJS_20_X,      handler: 'cancelOrder.handler',      code: lambda.Code.fromAsset('lambda'),      environment: {        ORDERS_TABLE: ordersTable.tableName      }    });
    const releaseInventory = new lambda.Function(this, 'ReleaseInventory', {      runtime: lambda.Runtime.NODEJS_20_X,      handler: 'releaseInventory.handler',      code: lambda.Code.fromAsset('lambda'),      environment: {        INVENTORY_TABLE: inventoryTable.tableName      }    });
    const refundPayment = new lambda.Function(this, 'RefundPayment', {      runtime: lambda.Runtime.NODEJS_20_X,      handler: 'refundPayment.handler',      code: lambda.Code.fromAsset('lambda'),      environment: {        PAYMENTS_TABLE: paymentsTable.tableName      }    });
    // Grant table permissions    ordersTable.grantReadWriteData(createOrder);    ordersTable.grantReadWriteData(confirmOrder);    ordersTable.grantReadWriteData(cancelOrder);    inventoryTable.grantReadWriteData(reserveInventory);    inventoryTable.grantReadWriteData(releaseInventory);    paymentsTable.grantReadWriteData(processPayment);    paymentsTable.grantReadWriteData(refundPayment);
    // Step Functions tasks with retry configuration    const createOrderTask = new tasks.LambdaInvoke(this, 'CreateOrderTask', {      lambdaFunction: createOrder,      outputPath: '$.Payload',      retryOnServiceExceptions: true    }).addRetry({      errors: ['ThrottlingException', 'ServiceUnavailable'],      interval: cdk.Duration.seconds(2),      maxAttempts: 3,      backoffRate: 2.0    });
    const reserveInventoryTask = new tasks.LambdaInvoke(this, 'ReserveInventoryTask', {      lambdaFunction: reserveInventory,      outputPath: '$.Payload',      retryOnServiceExceptions: true    }).addRetry({      errors: ['ThrottlingException'],      interval: cdk.Duration.seconds(1),      maxAttempts: 3,      backoffRate: 1.5    });
    const processPaymentTask = new tasks.LambdaInvoke(this, 'ProcessPaymentTask', {      lambdaFunction: processPayment,      outputPath: '$.Payload',      retryOnServiceExceptions: true,      timeout: cdk.Duration.seconds(30)    }).addRetry({      errors: ['PaymentProcessingException'],      interval: cdk.Duration.seconds(3),      maxAttempts: 2,      backoffRate: 2.0    });
    const confirmOrderTask = new tasks.LambdaInvoke(this, 'ConfirmOrderTask', {      lambdaFunction: confirmOrder,      outputPath: '$.Payload'    });
    // Compensation tasks    const cancelOrderTask = new tasks.LambdaInvoke(this, 'CancelOrderTask', {      lambdaFunction: cancelOrder,      outputPath: '$.Payload'    });
    const releaseInventoryTask = new tasks.LambdaInvoke(this, 'ReleaseInventoryTask', {      lambdaFunction: releaseInventory,      outputPath: '$.Payload'    });
    const refundPaymentTask = new tasks.LambdaInvoke(this, 'RefundPaymentTask', {      lambdaFunction: refundPayment,      outputPath: '$.Payload'    });
    // Saga orchestration with compensation logic    // Build compensation chain backwards    const compensatePayment = refundPaymentTask      .next(releaseInventoryTask)      .next(cancelOrderTask)      .next(new sfn.Fail(this, 'OrderFailed', {        error: 'OrderProcessingFailed',        cause: 'Payment processing failed, all steps compensated'      }));
    const compensateInventory = releaseInventoryTask      .next(cancelOrderTask)      .next(new sfn.Fail(this, 'InventoryFailed', {        error: 'InventoryReservationFailed',        cause: 'Inventory reservation failed, order cancelled'      }));
    // Build forward flow with catch blocks for compensation    const sagaDefinition = createOrderTask      .next(reserveInventoryTask        .addCatch(compensateInventory, {          errors: ['States.ALL'],          resultPath: '$.inventoryError'        })      )      .next(processPaymentTask        .addCatch(compensatePayment, {          errors: ['States.ALL'],          resultPath: '$.paymentError'        })      )      .next(confirmOrderTask)      .next(new sfn.Succeed(this, 'OrderSuccess'));
    // Create saga state machine    const orderSaga = new sfn.StateMachine(this, 'OrderSaga', {      definition: sagaDefinition,      stateMachineType: sfn.StateMachineType.STANDARD,      timeout: cdk.Duration.minutes(5),      tracingEnabled: true    });
    new cdk.CfnOutput(this, 'OrderSagaArn', {      value: orderSaga.stateMachineArn,      description: 'Order Saga State Machine ARN'    });  }}

This infrastructure sets up a complete order processing saga with proper compensation chains. Notice how each step has retry configuration for transient errors, and compensation flows are built in reverse order; this is critical for properly undoing completed steps.

Implementing Idempotent Operations

Idempotency is non-negotiable in sagas. Steps may execute multiple times due to retries, failures, or network issues. Here's how to implement properly idempotent operations:

typescript
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';import { DynamoDBDocumentClient, UpdateCommand, GetCommand } from '@aws-sdk/lib-dynamodb';
const client = new DynamoDBClient({});const docClient = DynamoDBDocumentClient.from(client);
interface ReserveInventoryEvent {  orderId: string;  productId: string;  quantity: number;  transactionId: string; // Idempotency key}
export const handler = async (event: ReserveInventoryEvent) => {  console.log('Reserving inventory', { orderId: event.orderId, productId: event.productId });
  try {    // Idempotent check: Has this transaction already completed?    const existingReservation = await docClient.send(      new GetCommand({        TableName: process.env.INVENTORY_TABLE!,        Key: { productId: event.productId }      })    );
    const reservation = existingReservation.Item?.reservations?.[event.transactionId];    if (reservation?.status === 'RESERVED') {      console.log('Inventory already reserved for this transaction', {        transactionId: event.transactionId      });      return {        success: true,        message: 'Idempotent: Already reserved',        reservationId: event.transactionId      };    }
    // Atomic update with conditional expression (semantic lock)    const result = await docClient.send(      new UpdateCommand({        TableName: process.env.INVENTORY_TABLE!,        Key: { productId: event.productId },        UpdateExpression:          'SET availableQuantity = availableQuantity - :qty, ' +          'reservedQuantity = reservedQuantity + :qty, ' +          'reservations.#txId = :reservation',        ConditionExpression: 'availableQuantity >= :qty',        ExpressionAttributeNames: {          '#txId': event.transactionId        },        ExpressionAttributeValues: {          ':qty': event.quantity,          ':reservation': {            orderId: event.orderId,            quantity: event.quantity,            status: 'RESERVED',            timestamp: Date.now(),            expiresAt: Date.now() + (15 * 60 * 1000) // 15 min timeout          }        },        ReturnValues: 'ALL_NEW'      })    );
    console.log('Inventory reserved successfully', {      orderId: event.orderId,      reservationId: event.transactionId    });
    return {      success: true,      reservationId: event.transactionId,      availableQuantity: result.Attributes?.availableQuantity    };
  } catch (error: any) {    if (error.name === 'ConditionalCheckFailedException') {      console.error('Insufficient inventory', {        productId: event.productId,        requested: event.quantity      });      throw new Error('InsufficientInventory');    }
    console.error('Failed to reserve inventory', error);    throw error;  }};

The idempotency check at the beginning ensures that if this function executes multiple times with the same transactionId, it returns the same result without side effects. The conditional expression provides an atomic "semantic lock"; only reserving inventory if sufficient quantity is available.

Compensation: Releasing Inventory

typescript
export const releaseInventoryHandler = async (event: ReserveInventoryEvent) => {  console.log('Releasing inventory reservation', {    orderId: event.orderId,    transactionId: event.transactionId  });
  try {    const reservation = await docClient.send(      new GetCommand({        TableName: process.env.INVENTORY_TABLE!,        Key: { productId: event.productId }      })    );
    const existingReservation = reservation.Item?.reservations?.[event.transactionId];
    if (!existingReservation) {      console.log('Idempotent: Reservation already released or never existed', {        transactionId: event.transactionId      });      return { success: true, message: 'Already released' };    }
    // Atomic release with conditional check    await docClient.send(      new UpdateCommand({        TableName: process.env.INVENTORY_TABLE!,        Key: { productId: event.productId },        UpdateExpression:          'SET availableQuantity = availableQuantity + :qty, ' +          'reservedQuantity = reservedQuantity - :qty ' +          'REMOVE reservations.#txId',        ConditionExpression: 'attribute_exists(reservations.#txId)',        ExpressionAttributeNames: {          '#txId': event.transactionId        },        ExpressionAttributeValues: {          ':qty': existingReservation.quantity        }      })    );
    console.log('Inventory released successfully', { transactionId: event.transactionId });    return { success: true };
  } catch (error: any) {    if (error.name === 'ConditionalCheckFailedException') {      // Reservation already released - idempotent success      return { success: true, message: 'Already released' };    }    throw error;  }};

Notice how the compensation is also idempotent. If the reservation doesn't exist, we return success; the desired end state is achieved.

Payment Processing with Idempotency

typescript
interface ProcessPaymentEvent {  orderId: string;  customerId: string;  amount: number;  currency: string;  paymentMethodId: string;  idempotencyKey: string;}
export const processPaymentHandler = async (event: ProcessPaymentEvent) => {  console.log('Processing payment', {    orderId: event.orderId,    amount: event.amount  });
  try {    // Check idempotency: Has this payment already been processed?    const existingPayment = await docClient.send(      new GetCommand({        TableName: process.env.PAYMENTS_TABLE!,        Key: { paymentId: event.idempotencyKey }      })    );
    if (existingPayment.Item?.status === 'COMPLETED') {      console.log('Idempotent: Payment already processed', {        paymentId: event.idempotencyKey      });      return {        success: true,        paymentId: event.idempotencyKey,        transactionId: existingPayment.Item.transactionId,        message: 'Already processed'      };    }
    // Create payment record with PENDING status first    const paymentId = event.idempotencyKey;    await docClient.send(      new UpdateCommand({        TableName: process.env.PAYMENTS_TABLE!,        Key: { paymentId },        UpdateExpression:          'SET #status = :pending, orderId = :orderId, ' +          'amount = :amount, createdAt = :timestamp',        ConditionExpression: 'attribute_not_exists(paymentId)',        ExpressionAttributeNames: { '#status': 'status' },        ExpressionAttributeValues: {          ':pending': 'PENDING',          ':orderId': event.orderId,          ':amount': event.amount,          ':timestamp': Date.now()        }      })    );
    // Call external payment provider (Stripe, etc.)    const paymentResult = await callPaymentProvider({      amount: event.amount,      currency: event.currency,      customerId: event.customerId,      paymentMethodId: event.paymentMethodId,      idempotencyKey: event.idempotencyKey    });
    if (!paymentResult.success) {      await docClient.send(        new UpdateCommand({          TableName: process.env.PAYMENTS_TABLE!,          Key: { paymentId },          UpdateExpression:            'SET #status = :failed, failureReason = :reason, updatedAt = :timestamp',          ExpressionAttributeNames: { '#status': 'status' },          ExpressionAttributeValues: {            ':failed': 'FAILED',            ':reason': paymentResult.error,            ':timestamp': Date.now()          }        })      );
      throw new Error(`PaymentFailed: ${paymentResult.error}`);    }
    // Update to COMPLETED status    await docClient.send(      new UpdateCommand({        TableName: process.env.PAYMENTS_TABLE!,        Key: { paymentId },        UpdateExpression:          'SET #status = :completed, transactionId = :txId, updatedAt = :timestamp',        ExpressionAttributeNames: { '#status': 'status' },        ExpressionAttributeValues: {          ':completed': 'COMPLETED',          ':txId': paymentResult.transactionId,          ':timestamp': Date.now()        }      })    );
    return {      success: true,      paymentId,      transactionId: paymentResult.transactionId    };
  } catch (error: any) {    if (error.name === 'ConditionalCheckFailedException') {      // Another execution already created the payment record      throw new Error('ConcurrentPaymentAttempt');    }    throw error;  }};
// Simulated payment provider callasync function callPaymentProvider(params: any) {  // In production: Stripe, Adyen, etc. with their idempotency keys  return {    success: Math.random() > 0.1, // 90% success rate    transactionId: `txn_${Date.now()}`,    error: 'CardDeclined'  };}

This payment implementation shows three-phase idempotency: check for existing completion, create PENDING record, then update to final state. This pattern ensures we never double-charge a customer even if the function executes multiple times.

Choreography Implementation with EventBridge

For simpler flows, choreography can provide better decoupling. Here's how to implement event-driven saga coordination:

typescript
import { EventBridgeClient, PutEventsCommand } from '@aws-sdk/client-eventbridge';
// Order Service: Creates order and publishes OrderCreated eventexport const createOrderHandler = async (event: any) => {  const orderId = `ord_${Date.now()}`;
  // Create order in database  await saveOrder({    orderId,    customerId: event.customerId,    items: event.items,    status: 'PENDING'  });
  // Publish OrderCreated event  const eventBridge = new EventBridgeClient({});  await eventBridge.send(    new PutEventsCommand({      Entries: [{        Source: 'order.service',        DetailType: 'OrderCreated',        Detail: JSON.stringify({          orderId,          customerId: event.customerId,          items: event.items,          totalAmount: event.totalAmount,          timestamp: Date.now()        })      }]    })  );
  return { orderId, status: 'PENDING' };};
// Inventory Service: Listens to OrderCreated, reserves inventoryexport const reserveInventoryOnOrderHandler = async (event: any) => {  const { orderId, items } = event.detail;
  try {    const reservationResult = await reserveInventoryForOrder(orderId, items);
    const eventBridge = new EventBridgeClient({});    await eventBridge.send(      new PutEventsCommand({        Entries: [{          Source: 'inventory.service',          DetailType: 'InventoryReserved',          Detail: JSON.stringify({            orderId,            reservationId: reservationResult.reservationId,            items,            timestamp: Date.now()          })        }]      })    );
  } catch (error: any) {    // Publish failure event    const eventBridge = new EventBridgeClient({});    await eventBridge.send(      new PutEventsCommand({        Entries: [{          Source: 'inventory.service',          DetailType: 'InventoryReservationFailed',          Detail: JSON.stringify({            orderId,            reason: error.message,            timestamp: Date.now()          })        }]      })    );  }};
// Payment Service: Listens to InventoryReserved, processes paymentexport const processPaymentOnInventoryReservedHandler = async (event: any) => {  const { orderId } = event.detail;
  try {    const order = await getOrder(orderId);    const paymentResult = await processPayment({      orderId,      customerId: order.customerId,      amount: order.totalAmount    });
    const eventBridge = new EventBridgeClient({});    await eventBridge.send(      new PutEventsCommand({        Entries: [{          Source: 'payment.service',          DetailType: 'PaymentCompleted',          Detail: JSON.stringify({            orderId,            paymentId: paymentResult.paymentId,            transactionId: paymentResult.transactionId,            timestamp: Date.now()          })        }]      })    );
  } catch (error: any) {    // Publish failure event - triggers compensation    const eventBridge = new EventBridgeClient({});    await eventBridge.send(      new PutEventsCommand({        Entries: [{          Source: 'payment.service',          DetailType: 'PaymentFailed',          Detail: JSON.stringify({            orderId,            reason: error.message,            timestamp: Date.now()          })        }]      })    );  }};
// Inventory Service: Listens to PaymentFailed, releases inventory (compensation)export const releaseInventoryOnPaymentFailedHandler = async (event: any) => {  const { orderId } = event.detail;
  console.log('Compensating: Releasing inventory due to payment failure', { orderId });
  await releaseInventory(orderId);
  const eventBridge = new EventBridgeClient({});  await eventBridge.send(    new PutEventsCommand({      Entries: [{        Source: 'inventory.service',        DetailType: 'InventoryReleased',        Detail: JSON.stringify({          orderId,          reason: 'PaymentFailed',          timestamp: Date.now()        })      }]    })  );};

In choreography, each service is responsible for listening to relevant events and publishing new events. Compensation happens through the same event mechanism; when a service publishes a failure event, other services react by executing their compensating transactions.

Semantic Locking for Isolation

Sagas lack traditional transaction isolation, which can lead to concurrent saga conflicts. Semantic locking provides application-level isolation:

typescript
interface SagaLock {  sagaId: string;  lockedAt: number;  expiresAt: number;}
export const acquireSagaLock = async (orderId: string, sagaId: string) => {  const lockDuration = 5 * 60 * 1000; // 5 minutes  const now = Date.now();
  try {    await docClient.send(      new UpdateCommand({        TableName: process.env.ORDERS_TABLE!,        Key: { orderId },        UpdateExpression:          'SET sagaLock = :lock, #status = :processing',        ConditionExpression:          'attribute_not_exists(sagaLock) OR sagaLock.expiresAt < :now',        ExpressionAttributeNames: {          '#status': 'status'        },        ExpressionAttributeValues: {          ':lock': {            sagaId,            lockedAt: now,            expiresAt: now + lockDuration          },          ':processing': 'PROCESSING',          ':now': now        }      })    );
    console.log('Saga lock acquired', { orderId, sagaId });    return true;
  } catch (error: any) {    if (error.name === 'ConditionalCheckFailedException') {      console.warn('Saga lock already held by another saga', { orderId });      throw new Error('SagaLockConflict');    }    throw error;  }};
export const releaseSagaLock = async (orderId: string, sagaId: string) => {  await docClient.send(    new UpdateCommand({      TableName: process.env.ORDERS_TABLE!,      Key: { orderId },      UpdateExpression: 'REMOVE sagaLock',      ConditionExpression: 'sagaLock.sagaId = :sagaId',      ExpressionAttributeValues: {        ':sagaId': sagaId      }    })  );
  console.log('Saga lock released', { orderId, sagaId });};

This semantic lock prevents two sagas from modifying the same order concurrently. The lock includes an expiration time to handle cases where a saga crashes without releasing the lock.

Cost Analysis and Trade-offs

Understanding the cost implications helps you choose the right approach.

Step Functions Orchestration Costs

For 100,000 orders per month with an 8-step saga:

  • Total state transitions: 800,000
  • Cost: (800,000 / 1,000) × 0.025=0.025 = **20/month**
  • Failed sagas (5% with 4-step compensation): ~20,000 transitions = +$0.50/month
  • Total: ~$20.50/month

EventBridge Choreography Costs

For 100,000 orders per month with 4 events per order:

  • Total events: 400,000
  • Cost: (400,000 / 1,000,000) × 1.00=1.00 = **0.40/month**
  • Failed orders with compensation: +$0.03/month
  • Total: ~$0.43/month

The Real Trade-off

Choreography is significantly cheaper but comes with higher development and debugging complexity. Orchestration costs more but provides better visibility and easier troubleshooting. For most production systems, the improved observability of orchestration is worth the additional cost.

Common Pitfalls and Solutions

Pitfall 1: Non-Idempotent Operations

Problem: Payment charged multiple times on retry.

Solution: Always implement idempotency checks and use provider idempotency keys.

Pitfall 2: Incomplete Compensation Chains

Problem: Only compensating the last step, leaving earlier steps in inconsistent state.

Solution: Chain all compensations in reverse order. Each catch block must compensate all previous steps.

Pitfall 3: Ignoring Compensation Failures

Problem: Compensation fails, saga hangs.

Solution: Aggressive retry for compensations (10+ attempts) with dead-letter queue for manual intervention.

Pitfall 4: Timeout Too Short

Problem: Timeout triggers compensation, but operation actually succeeded.

Solution: Set realistic timeouts with buffer. Verify actual state before compensating.

Pitfall 5: No Saga State Tracking in Choreography

Problem: Can't determine which orders are in compensation.

Solution: Persist saga state even in choreography for observability.

Key Takeaways

  1. Choose orchestration for complex flows (>4 services), choreography for simple linear flows
  2. Idempotency is critical; every saga step must be idempotent with proper keys
  3. Compensate in reverse order; make compensations idempotent and retryable
  4. Use semantic locking to prevent concurrent saga conflicts
  5. Set realistic timeouts and verify state before compensating
  6. Implement observability from day one; structured logging, metrics, tracing
  7. Keep sagas simple (3-5 steps); chain multiple sagas for complex workflows
  8. Persist saga state even in choreography for debugging
  9. Retry compensations aggressively with DLQ for failures requiring manual intervention
  10. Consider cost vs complexity; orchestration provides better visibility but costs more

The Saga pattern provides a robust solution for distributed transactions in microservices. By understanding the trade-offs between orchestration and choreography, implementing proper idempotency and compensation, and establishing strong observability, you can build reliable distributed systems that maintain consistency across service boundaries.

Helper Functions

typescript
// Simplified helper functions referenced in examplesimport { PutCommand } from '@aws-sdk/lib-dynamodb';
async function saveOrder(order: any) {  await docClient.send(    new PutCommand({      TableName: process.env.ORDERS_TABLE!,      Item: order    })  );}
async function getOrder(orderId: string) {  const result = await docClient.send(    new GetCommand({      TableName: process.env.ORDERS_TABLE!,      Key: { orderId }    })  );  return result.Item as any;}
async function reserveInventoryForOrder(orderId: string, items: any[]) {  return { reservationId: `res_${Date.now()}` };}
async function releaseInventory(orderId: string) {  // Release logic}
async function processPayment(params: any) {  return {    paymentId: `pay_${Date.now()}`,    transactionId: `txn_${Date.now()}`  };}
async function updateOrderStatus(orderId: string, status: string) {  await docClient.send(    new UpdateCommand({      TableName: process.env.ORDERS_TABLE!,      Key: { orderId },      UpdateExpression: 'SET #status = :status',      ExpressionAttributeNames: { '#status': 'status' },      ExpressionAttributeValues: { ':status': status }    })  );}

References

Related Posts