Skip to content

Circuit Breaker Pattern: Building Resilient Microservices That Don't Cascade Failures

Real-world implementation of the Circuit Breaker pattern with proven strategies for preventing cascading failures in distributed systems

When a payment service fails slowly rather than quickly, it can take down an entire platform. Each request taking 30 seconds to timeout creates a traffic jam that backs up through other services. This cascading failure pattern is common in distributed systems. Here's how the Circuit Breaker pattern addresses this problem, with lessons learned from working through these incidents.

The Problem: When Slow is Worse Than Dead

Picture this: Your payment provider's API starts responding slowly. Not down, just taking 20-30 seconds per request instead of the usual 200ms. Your service dutifully waits. Meanwhile, incoming requests pile up. Thread pools exhaust. Memory consumption spikes. Eventually, your healthy service becomes unhealthy, and the infection spreads upstream.

This pattern can kill entire platforms. The challenging part? Monitoring shows all services are "up" - they're just not responding.

Circuit Breaker: Your System's Safety Valve

The Circuit Breaker pattern acts like an electrical circuit breaker in your house. When things go wrong, it trips, preventing damage from spreading. But unlike your home's breaker, this one is smart - it can test if the problem is fixed and automatically recover.

The Three States

typescript
enum CircuitState {  CLOSED = 'CLOSED',  // Normal operation, requests flow through  OPEN = 'OPEN',  // Circuit tripped, requests fail immediately  HALF_OPEN = 'HALF_OPEN' // Testing if service recovered}

Think of it like a bouncer at a club:

  • CLOSED: "Come on in, everything's fine"
  • OPEN: "Nobody gets in, there's a problem inside"
  • HALF_OPEN: "Let me check with one person if it's safe now"

Real Implementation: What Actually Works

Here's a circuit breaker implementation that addresses these challenges. This pattern has proven reliable across services handling high request volumes:

typescript
interface CircuitBreakerConfig {  failureThreshold: number;  // Failures before opening  successThreshold: number;  // Successes to close from half-open  timeout: number;  // Request timeout in ms  resetTimeout: number;  // Time before trying half-open  volumeThreshold: number;  // Min requests before evaluating  errorThresholdPercentage: number; // Error % to trip}
class CircuitBreaker<T> {  private state: CircuitState = CircuitState.CLOSED;  private failureCount = 0;  private successCount = 0;  private lastFailureTime?: Date;  private requestCount = 0;  private errorCount = 0;  private window = new RollingWindow(10000); // 10 second window
  constructor(    private readonly config: CircuitBreakerConfig,    private readonly protectedFunction: () => Promise<T>  ) {}
  async execute(): Promise<T> {    // Check if we should attempt half-open    if (this.state === CircuitState.OPEN) {      if (this.shouldAttemptReset()) {        this.state = CircuitState.HALF_OPEN;      } else {        throw new CircuitOpenError('Circuit breaker is OPEN');      }    }
    try {      const result = await this.executeWithTimeout();      this.onSuccess();      return result;    } catch (error) {      this.onFailure();      throw error;    }  }
  private async executeWithTimeout(): Promise<T> {    return Promise.race([      this.protectedFunction(),      new Promise<T>((_, reject) =>        setTimeout(() => reject(new TimeoutError()), this.config.timeout)      )    ]);  }
  private onSuccess(): void {    this.failureCount = 0;    this.window.recordSuccess();
    if (this.state === CircuitState.HALF_OPEN) {      this.successCount++;      if (this.successCount >= this.config.successThreshold) {        this.state = CircuitState.CLOSED;        this.successCount = 0;      }    }  }
  private onFailure(): void {    this.failureCount++;    this.lastFailureTime = new Date();    this.window.recordFailure();
    if (this.state === CircuitState.HALF_OPEN) {      this.state = CircuitState.OPEN;      this.successCount = 0;      return;    }
    // Check both absolute and percentage thresholds    const stats = this.window.getStats();    if (stats.totalRequests >= this.config.volumeThreshold) {      const errorRate = (stats.failures / stats.totalRequests) * 100;      if (errorRate >= this.config.errorThresholdPercentage ||          this.failureCount >= this.config.failureThreshold) {        this.state = CircuitState.OPEN;      }    }  }
  private shouldAttemptReset(): boolean {    return this.lastFailureTime &&      Date.now() - this.lastFailureTime.getTime() >= this.config.resetTimeout;  }}

Lessons from Production: What the Tutorials Don't Tell You

1. Timeout is Your Most Important Setting

Analysis of incident patterns shows most failures (around 70%) are caused by slow responses, not complete failures. Setting timeouts aggressively helps:

typescript
const config = {  timeout: 3000,  // 3 seconds - our P99 is 1.2s, so this catches problems  // NOT 30000!  // This killed us. Waiting 30s = thread exhaustion};

Example timing from a payment service:

  • Normal P50: 180ms
  • Normal P99: 1.2s
  • Circuit breaker timeout: 3s
  • Result: Significant reduction in cascading failures

2. The Half-Open State Gotcha

Early on, we'd trip to half-open, send one request, succeed, close the circuit, then immediately fail again with full traffic. The fix: require multiple successes before closing.

typescript
// Don't do thisif (testRequest.succeeded) {  this.state = CircuitState.CLOSED; // Boom! Full traffic returns}
// Do this insteadif (++this.successCount >= this.config.successThreshold) {  this.state = CircuitState.CLOSED; // Gradual recovery}

3. Combine with Retry Logic (But Carefully)

Circuit breakers and retries can create feedback loops. Here's a reliable combination:

typescript
class ResilientClient {  private circuitBreaker: CircuitBreaker<any>;
  async callWithResilience(request: Request): Promise<Response> {    // Circuit breaker wraps retry logic, not vice versa    return this.circuitBreaker.execute(async () => {      return await this.retryWithBackoff(request, {        maxAttempts: 3,        backoffMs: [100, 200, 400],        shouldRetry: (error) => {          // Don't retry circuit breaker errors          if (error instanceof CircuitOpenError) return false;          // Don't retry client errors          if (error.statusCode >= 400 && error.statusCode < 500) return false;          return true;        }      });    });  }}

4. Monitor the Right Metrics

What to track (in order of importance):

  1. Circuit state changes - Alert immediately on OPEN
  2. Reset attempt results - Failed resets = ongoing problem
  3. Request rejection rate - Business impact metric
  4. Time in OPEN state - Helps tune reset timeout

Our CloudWatch dashboard:

typescript
// Custom metrics we pushawait cloudwatch.putMetricData({  Namespace: 'CircuitBreakers',  MetricData: [    {      MetricName: 'StateChange',      Value: 1,      Unit: 'Count',      Dimensions: [        { Name: 'ServiceName', Value: this.serviceName },        { Name: 'FromState', Value: oldState },        { Name: 'ToState', Value: newState }      ]    },    {      MetricName: 'RejectedRequests',      Value: rejectedCount,      Unit: 'Count',      Dimensions: [{ Name: 'ServiceName', Value: this.serviceName }]    }  ]});

Advanced Patterns: Beyond Basic Circuit Breaking

Bulkheading: Isolated Circuit Breakers

Don't use one circuit breaker for an entire service. Isolate critical paths:

typescript
class PaymentService {  private readonly chargeBreaker = new CircuitBreaker(chargeConfig);  private readonly refundBreaker = new CircuitBreaker(refundConfig);  private readonly queryBreaker = new CircuitBreaker(queryConfig);
  async chargeCard(request: ChargeRequest): Promise<ChargeResponse> {    // Charging failures don't affect refunds    return this.chargeBreaker.execute(() => this.api.charge(request));  }
  async refundPayment(request: RefundRequest): Promise<RefundResponse> {    // Refunds stay available even if charges are failing    return this.refundBreaker.execute(() => this.api.refund(request));  }}

This pattern proves valuable during high-traffic periods when one endpoint becomes overwhelmed while others remain available.

Fallback Strategies

Not all failures are equal. Sometimes you can degrade gracefully:

typescript
async getProductRecommendations(userId: string): Promise<Product[]> {  try {    return await this.recommendationBreaker.execute(      () => this.mlService.getRecommendations(userId)    );  } catch (error) {    if (error instanceof CircuitOpenError) {      // Fallback to simple popularity-based recommendations      return this.getPopularProducts();    }    throw error;  }}

Circuit Breaker Inheritance

For microservices calling other microservices, inherit circuit state:

typescript
// API Gatewayif (paymentServiceBreaker.state === CircuitState.OPEN) {  // Don't even try to call order service which depends on payment  return { error: 'Payment service unavailable', status: 503 };}

Real-World Configuration Examples

Here's what actually works in production for different service types:

typescript
// External API (payment providers, third-party services)const externalAPIConfig: CircuitBreakerConfig = {  failureThreshold: 5,  // 5 consecutive failures  successThreshold: 2,  // 2 successes to recover  timeout: 5000,  // 5 second timeout  resetTimeout: 30000,  // Try recovery after 30s  volumeThreshold: 10,  // Need 10 requests minimum  errorThresholdPercentage: 50  // 50% error rate trips};
// Internal microserviceconst internalServiceConfig: CircuitBreakerConfig = {  failureThreshold: 10,  // More tolerant  successThreshold: 3,  timeout: 3000,  // Faster timeout  resetTimeout: 10000,  // Faster recovery attempts  volumeThreshold: 20,  errorThresholdPercentage: 30  // More sensitive to error rates};
// Database connectionsconst databaseConfig: CircuitBreakerConfig = {  failureThreshold: 3,  // Quick to trip  successThreshold: 5,  // Slow to recover  timeout: 1000,  // Very fast timeout  resetTimeout: 5000,  // Quick retry  volumeThreshold: 5,  errorThresholdPercentage: 20  // Very sensitive};

Testing Circuit Breakers: Chaos Engineering

You can't trust a circuit breaker you haven't tested. Here's our chaos testing approach:

typescript
describe('Circuit Breaker Chaos Tests', () => {  it('should handle gradual degradation', async () => {    const scenarios = [      { latency: 100, errorRate: 0 },  // Normal      { latency: 500, errorRate: 0.1 },  // Slight degradation      { latency: 2000, errorRate: 0.3 }, // Major degradation      { latency: 5000, errorRate: 0.7 }, // Near failure    ];
    for (const scenario of scenarios) {      mockService.setScenario(scenario);      await runLoadTest(1000); // 1000 requests
      const metrics = await breaker.getMetrics();      if (scenario.errorRate > 0.5) {        expect(breaker.state).toBe(CircuitState.OPEN);      }    }  });});

In production, we use AWS Fault Injection Simulator to randomly inject failures and verify our circuit breakers respond correctly.

The Mistakes That Cost Us

Mistake 1: Client-Side Only Circuit Breaking

We initially implemented circuit breakers only in clients. When the server itself had issues, it couldn't protect itself:

typescript
// Bad: Client protects itself but server still overwhelmedclass Client {  private breaker = new CircuitBreaker();  async call() { return this.breaker.execute(() => fetch('/api')); }}
// Good: Server also protects itselfclass Server {  private downstreamBreaker = new CircuitBreaker();  async handleRequest(req, res) {    try {      const data = await this.downstreamBreaker.execute(() =>        this.database.query(req.query)      );      res.json(data);    } catch (error) {      if (error instanceof CircuitOpenError) {        res.status(503).json({ error: 'Service temporarily unavailable' });      }    }  }}

Mistake 2: Sharing Circuit Breakers Across Unrelated Operations

We had one circuit breaker for "database operations". When writes failed, reads were also blocked:

typescript
// Bad: One breaker for everythingclass UserService {  private dbBreaker = new CircuitBreaker();
  async getUser(id) {    return this.dbBreaker.execute(() => db.query('SELECT...'));  }
  async createUser(data) {    return this.dbBreaker.execute(() => db.query('INSERT...'));  }}
// Good: Separate breakers for different operationsclass UserService {  private readBreaker = new CircuitBreaker(readConfig);  private writeBreaker = new CircuitBreaker(writeConfig);
  async getUser(id) {    return this.readBreaker.execute(() => db.query('SELECT...'));  }
  async createUser(data) {    return this.writeBreaker.execute(() => db.query('INSERT...'));  }}

Mistake 3: Not Considering Business Impact

We treated all services equally. Then we blocked payment processing while letting metrics collection through. Learned that lesson quickly.

The Implementation Checklist

When implementing circuit breakers, here's a useful checklist:

  • Set timeout to 2-3x your P99 latency
  • Require multiple successes before closing from half-open
  • Implement separate breakers for read/write operations
  • Add fallback behavior for business-critical paths
  • Export metrics for state changes and rejections
  • Test with chaos engineering before production
  • Document timeout and threshold choices
  • Alert on circuit OPEN, not on individual failures
  • Consider business priority in configuration
  • Implement gradual recovery, not instant

Final Thoughts: It's About Failing Fast

A key insight: sometimes the best thing a service can do is fail immediately. A 503 response in 10ms is far better than a timeout after 30 seconds. Users can retry quickly, and systems can recover. Thread exhaustion leads to much more serious problems.

Circuit breakers aren't about preventing failures - they're about preventing failures from spreading. They're about maintaining enough system health that when the problem is fixed, you can actually recover.

Implementing circuit breakers before you encounter problems makes crisis response much smoother.

References

Related Posts