Skip to content

Building AWS Serverless with TypeScript: Hard-Won Lessons from Lambda at Scale

Why I moved from Express.js to Lambda, the costly mistakes I made along the way, and the TypeScript patterns that saved my team thousands in AWS bills.

I was running a traditional Express.js API on EC2 instances. Fixed costs, predictable scaling, 99.9% uptime. Life was good. Then our biggest client asked for a feature that needed to process 50,000 webhooks in under 10 minutes, once per month.

Keeping EC2 instances running 24/7 for a 10-minute monthly spike felt wasteful. That's when I dove headfirst into AWS Lambda. Here's what I learned from building production Lambda functions, making every serverless mistake possible, and spending way too much on AWS bills.

Why I Finally Embraced Serverless (After Years of Resistance)

I used to be that guy who called serverless "vendor lock-in with extra steps." Coming from a background of managing Kubernetes clusters and fine-tuning JVM garbage collectors, Lambda felt like giving up control. But three incidents changed my mind:

The Unexpected Traffic Spike (June 2022)

Our Express API got featured on Hacker News at 2 AM. Traffic went from 100 req/min to 5,000 req/min. Our auto-scaling group took 8 minutes to spin up new instances. By then, we'd experienced significant payment processing failures and our Redis cache was overwhelmed.

Lambda would have scaled instantly. This incident highlighted the value of automatic scaling.

The Webhook Processing Challenge (August 2022)

A client needed to process Stripe webhooks that could arrive in bursts of 10,000+ events. With EC2, we had two bad options:

  1. Over-provision for peak load (expensive)
  2. Use queues and risk webhook timeouts (unreliable)

Lambda's automatic concurrency scaling solved this elegantly. Each webhook got its own function instance. No queues, no timeouts, no over-provisioning.

The Compute Utilization Analysis (October 2022)

Analyzing our actual compute utilization revealed that our API servers were idle 87% of the time, yet we paid for 100% capacity. The monthly costs for unused resources added up significantly.

Lambda's pay-per-millisecond model addressed this inefficiency directly.

The Stack That Actually Works in Production

After burning through multiple approaches, here's what we settled on:

typescript
// Our production CDK stack - refined through painimport { Stack, StackProps, Duration, RemovalPolicy } from 'aws-cdk-lib';import { Construct } from 'constructs';import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';import { RestApi, LambdaIntegration, Cors, MethodLoggingLevel } from 'aws-cdk-lib/aws-apigateway';import { Table, AttributeType, BillingMode } from 'aws-cdk-lib/aws-dynamodb';import { Runtime, Tracing } from 'aws-cdk-lib/aws-lambda';
export class ProductionServerlessStack extends Stack {  constructor(scope: Construct, id: string, props?: StackProps) {    super(scope, id, props);
    // DynamoDB table - learned to use single-table design the hard way    const dataTable = new Table(this, 'DataTable', {      partitionKey: { name: 'PK', type: AttributeType.STRING },      sortKey: { name: 'SK', type: AttributeType.STRING },      billingMode: BillingMode.PAY_PER_REQUEST,  // On-demand pricing saved us during spikes      // Point-in-time recovery saved us from a junior dev's DELETE mistake      pointInTimeRecovery: true,      removalPolicy: RemovalPolicy.RETAIN,  // Never accidentally delete prod data    });
    // Add GSI for querying by different access patterns    dataTable.addGlobalSecondaryIndex({      indexName: 'GSI1',      partitionKey: { name: 'GSI1PK', type: AttributeType.STRING },      sortKey: { name: 'GSI1SK', type: AttributeType.STRING },    });
    // Lambda function with production-ready settings    const apiHandler = new NodejsFunction(this, 'ApiHandler', {      entry: 'src/handlers/api.ts',      runtime: Runtime.NODEJS_20_X,      // Memory sizing based on actual profiling, not guesses      memorySize: 1024,  // Sweet spot for our JSON processing workload      timeout: Duration.seconds(28),  // Just under API Gateway's 29s limit      environment: {        TABLE_NAME: dataTable.tableName,        NODE_ENV: 'production',        // Enable connection reuse for DynamoDB        AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',        // Custom env vars        LOG_LEVEL: 'info',        ENABLE_X_RAY: 'true',      },      bundling: {        minify: true,        target: 'node20',        // Exclude aws-sdk from bundle - Lambda runtime provides it        externalModules: ['@aws-sdk/*'],        // Tree-shake unused code        treeShaking: true,        // Source maps for debugging prod issues        sourceMap: true,        // Define for dead code elimination        define: {          'process.env.NODE_ENV': '"production"',        },      },      // Enable X-Ray tracing for debugging      tracing: Tracing.ACTIVE,      // Reserved concurrency to prevent Lambda from consuming entire account limit      reservedConcurrentExecutions: 100,    });
    // Grant DynamoDB permissions    dataTable.grantReadWriteData(apiHandler);
    // API Gateway with proper CORS and throttling    const api = new RestApi(this, 'ServerlessApi', {      restApiName: 'production-serverless-api',      description: 'Production serverless API with proper error handling',      defaultCorsPreflightOptions: {        allowOrigins: process.env.NODE_ENV === 'production'          ? ['https://yourdomain.com']          : Cors.ALL_ORIGINS,        allowMethods: Cors.ALL_METHODS,        allowHeaders: ['Content-Type', 'Authorization', 'X-Amz-Date'],      },      deployOptions: {        // Stage-specific throttling        throttlingRateLimit: 1000,        throttlingBurstLimit: 2000,        // Enable detailed CloudWatch metrics        metricsEnabled: true,        loggingLevel: MethodLoggingLevel.INFO,        // Enable X-Ray tracing        tracingEnabled: true,      },    });
    // Add resource with proper integration    const items = api.root.addResource('items');    items.addMethod('GET', new LambdaIntegration(apiHandler));    items.addMethod('POST', new LambdaIntegration(apiHandler));
    const singleItem = items.addResource('{id}');    singleItem.addMethod('GET', new LambdaIntegration(apiHandler));    singleItem.addMethod('PUT', new LambdaIntegration(apiHandler));    singleItem.addMethod('DELETE', new LambdaIntegration(apiHandler));  }}

The Lambda Handler That Handles Reality

Here's our production Lambda handler, complete with all the error handling and optimizations learned from countless production incidents:

typescript
// src/handlers/api.tsimport { APIGatewayProxyHandler, APIGatewayProxyResult } from 'aws-lambda';import { DynamoDBClient } from '@aws-sdk/client-dynamodb';import { DynamoDBDocumentClient, GetCommand, PutCommand, QueryCommand } from '@aws-sdk/lib-dynamodb';
// Create DynamoDB client outside handler for connection reuseconst dynamoClient = new DynamoDBClient({  region: process.env.AWS_REGION,  // Connection pooling settings that reduced our costs by 15%  maxAttempts: 3,  requestHandler: {    connectionTimeout: 1000,    socketTimeout: 1000,  },});
const docClient = DynamoDBDocumentClient.from(dynamoClient, {  marshallOptions: {    removeUndefinedValues: true,  // Prevents DynamoDB validation errors    convertEmptyValues: false,  },});
interface Item {  id: string;  name: string;  description?: string;  createdAt: string;  updatedAt: string;}
// The handler that processes high-volume requestsexport const handler: APIGatewayProxyHandler = async (event): Promise<APIGatewayProxyResult> => {  // Performance optimization: parse once, use everywhere  const { httpMethod, pathParameters, body, requestContext } = event;  const requestId = requestContext.requestId;
  // Structured logging that actually helps during incidents  console.log('Request received', {    requestId,    method: httpMethod,    path: event.path,    pathParams: pathParameters,    userAgent: event.headers['User-Agent'],    sourceIp: event.requestContext.identity.sourceIp,  });
  try {    switch (httpMethod) {      case 'GET':        return await handleGet(pathParameters?.id, requestId);      case 'POST':        return await handlePost(body, requestId);      case 'PUT':        return await handlePut(pathParameters?.id, body, requestId);      case 'DELETE':        return await handleDelete(pathParameters?.id, requestId);      default:        return createResponse(405, { error: 'Method not allowed' });    }  } catch (error) {    // Error handling that survived production incidents    console.error('Handler error', {      requestId,      error: error.message,      stack: error.stack,      // Sanitized request data (never log sensitive info)      method: httpMethod,      path: event.path,    });
    // Different error responses based on error type    if (error.name === 'ValidationException') {      return createResponse(400, { error: 'Invalid request data' });    }
    if (error.name === 'ConditionalCheckFailedException') {      return createResponse(409, { error: 'Resource conflict' });    }
    if (error.name === 'ResourceNotFoundException') {      return createResponse(404, { error: 'Resource not found' });    }
    // Generic server error for unexpected issues    return createResponse(500, {      error: 'Internal server error',      requestId,  // Include for support tickets    });  }};
async function handleGet(id: string | undefined, requestId: string): Promise<APIGatewayProxyResult> {  if (!id) {    // List all items with pagination    const result = await docClient.send(new QueryCommand({      TableName: process.env.TABLE_NAME!,      KeyConditionExpression: 'PK = :pk',      ExpressionAttributeValues: {        ':pk': 'ITEM',      },      Limit: 50,  // Prevent large scans that timeout    }));
    const items = result.Items?.map(item => ({      id: item.SK.replace('ITEM#', ''),      name: item.name,      description: item.description,      createdAt: item.createdAt,      updatedAt: item.updatedAt,    })) || [];
    return createResponse(200, { items, count: items.length, requestId });  }
  // Get single item  const result = await docClient.send(new GetCommand({    TableName: process.env.TABLE_NAME!,    Key: {      PK: 'ITEM',      SK: `ITEM#${id}`,    },  }));
  if (!result.Item) {    return createResponse(404, { error: 'Item not found', requestId });  }
  const item: Item = {    id: result.Item.SK.replace('ITEM#', ''),    name: result.Item.name,    description: result.Item.description,    createdAt: result.Item.createdAt,    updatedAt: result.Item.updatedAt,  };
  return createResponse(200, { item, requestId });}
async function handlePost(body: string | null, requestId: string): Promise<APIGatewayProxyResult> {  if (!body) {    return createResponse(400, { error: 'Request body is required', requestId });  }
  let data: Partial<Item>;  try {    data = JSON.parse(body);  } catch (error) {    return createResponse(400, { error: 'Invalid JSON', requestId });  }
  // Validation that prevented many production bugs  if (!data.name || typeof data.name !== 'string' || data.name.trim().length === 0) {    return createResponse(400, { error: 'Name is required and must be a non-empty string', requestId });  }
  if (data.name.length > 100) {    return createResponse(400, { error: 'Name must be 100 characters or less', requestId });  }
  const id = generateId();  // Custom ID generation  const now = new Date().toISOString();
  const item: Item = {    id,    name: data.name.trim(),    description: data.description?.trim() || undefined,    createdAt: now,    updatedAt: now,  };
  // Single-table design with composite keys  await docClient.send(new PutCommand({    TableName: process.env.TABLE_NAME!,    Item: {      PK: 'ITEM',      SK: `ITEM#${id}`,      ...item,      // GSI keys for alternative access patterns      GSI1PK: 'ITEMS_BY_NAME',      GSI1SK: item.name.toLowerCase(),    },    // Prevent overwriting existing items    ConditionExpression: 'attribute_not_exists(PK)',  }));
  console.log('Item created', { requestId, itemId: id });
  return createResponse(201, { item, requestId });}
// Utility function for consistent responsesfunction createResponse(statusCode: number, body: any): APIGatewayProxyResult {  return {    statusCode,    headers: {      'Content-Type': 'application/json',      'Access-Control-Allow-Origin': '*',  // Adjust for production      'Access-Control-Allow-Headers': 'Content-Type,Authorization',      'X-Request-ID': body.requestId || 'unknown',    },    body: JSON.stringify(body),  };}
// Generate URL-safe unique IDsfunction generateId(): string {  return `${Date.now().toString(36)}-${Math.random().toString(36).substr(2, 9)}`;}

Cost Optimization Lessons That Saved Thousands

1. Memory vs. CPU Trade-offs

I spent weeks optimizing our Lambda memory settings. Here's what I learned:

typescript
// Memory profiling revealed surprising insights// Note: These are example calculations based on typical workloads - your costs may varyconst memoryConfigs = [  { memory: 512, avgDuration: 850, avgCost: 0.0012 },  // CPU-bound  { memory: 1024, avgDuration: 420, avgCost: 0.0009 },  // Sweet spot  { memory: 1536, avgDuration: 380, avgCost: 0.0011 },  // Diminishing returns  { memory: 3008, avgDuration: 360, avgCost: 0.0021 },  // Overprovisioned];

1024 MB was our sweet spot. More memory = faster execution = lower cost, up to a point.

2. Connection Reuse Saved 15% on AWS Bills

typescript
// Before: New connection every invocation = expensiveconst dynamoClient = new DynamoDBClient({ region: 'us-east-1' });
// After: Connection reuse = 15% cost reductionconst dynamoClient = new DynamoDBClient({  region: 'us-east-1',  maxAttempts: 3,  requestHandler: {    connectionTimeout: 1000,    socketTimeout: 1000,  },});
// Enable HTTP keep-aliveprocess.env.AWS_NODEJS_CONNECTION_REUSE_ENABLED = '1';

3. Bundle Size Optimization

typescript
// CDK bundling config that reduced cold starts by 40%bundling: {  minify: true,  target: 'node20',  externalModules: [    '@aws-sdk/*',  // Use Lambda runtime version    'aws-lambda',  // Already available  ],  treeShaking: true,  sourceMap: process.env.NODE_ENV !== 'production',  // Debug info only in dev  define: {    'process.env.NODE_ENV': '"production"',  },  banner: '/* Production Lambda bundle */',  // Critical: exclude large dependencies  nodeModules: {    // Only bundle what we actually use    'lodash': {      include: ['throttle', 'debounce'],  // Tree-shake unused functions    },  },}

The Monitoring Setup That Actually Alerts on Real Issues

After too many unnecessary alerts for non-issues, here's our production monitoring:

typescript
// CloudWatch alarms that don't cry wolfimport { Alarm, Metric, TreatMissingData } from 'aws-cdk-lib/aws-cloudwatch';import { Function } from 'aws-cdk-lib/aws-lambda';
export class ServerlessMonitoring extends Construct {  constructor(scope: Construct, id: string, props: { lambdaFunction: Function }) {    super(scope, id);
    // Error rate alarm - 5% error rate over 5 minutes    const errorAlarm = new Alarm(this, 'HighErrorRate', {      metric: props.lambdaFunction.metricErrors({        statistic: 'Sum',        period: Duration.minutes(5),      }).with({        statistic: 'Average',      }),      threshold: 0.05,  // 5% error rate      evaluationPeriods: 2,      treatMissingData: TreatMissingData.NOT_BREACHING,    });
    // Duration alarm - 95th percentile over 5 seconds    const durationAlarm = new Alarm(this, 'SlowRequests', {      metric: props.lambdaFunction.metricDuration({        statistic: 'p95',        period: Duration.minutes(5),      }),      threshold: 5000,  // 5 seconds      evaluationPeriods: 3,    });
    // Throttle alarm - any throttling is bad    const throttleAlarm = new Alarm(this, 'ThrottledRequests', {      metric: props.lambdaFunction.metricThrottles({        statistic: 'Sum',        period: Duration.minutes(1),      }),      threshold: 1,      evaluationPeriods: 1,    });
    // Custom metric for business logic errors    const businessErrorAlarm = new Alarm(this, 'BusinessLogicErrors', {      metric: new Metric({        namespace: 'MyApp/Lambda',        metricName: 'BusinessErrors',        statistic: 'Sum',      }),      threshold: 10,      evaluationPeriods: 2,    });  }}

The Mistakes That Cost Me Sleep (and Money)

1. The Concurrent Execution Limit Issue

During a high-traffic event, our webhook processing Lambda consumed all 1,000 concurrent executions in our AWS account. Our main API experienced downtime because it couldn't get any Lambda capacity.

Fix: Set reserved concurrency on critical functions:

typescript
reservedConcurrentExecutions: 100,  // Guarantee capacity

2. The DynamoDB Hot Partition Problem

Sequential IDs for DynamoDB partition keys caused all traffic to hit one partition. Read/write throttling significantly degraded performance.

Fix: Distributed partition keys:

typescript
// Bad: Sequential IDs create hot partitionsPK: `USER#${sequentialId}`
// Good: UUID or timestamp + randomPK: `USER#${uuid.v4()}`// Or: Use current hour + random for time-based accessPK: `USER#${new Date().getHours()}-${Math.random().toString(36)}`

3. The 15-Minute Timeout Discovery

Lambda functions were timing out after exactly 15 minutes. Initially suspected a memory leak, but discovered AWS has a 15-minute maximum execution time limit. Large batches were being processed synchronously.

Fix: Batch processing with pagination:

typescript
// Process in smaller chunksconst BATCH_SIZE = 100;const MAX_EXECUTION_TIME = 14 * 60 * 1000; // 14 minutesconst startTime = Date.now();
for (let i = 0; i < items.length; i += BATCH_SIZE) {  if (Date.now() - startTime > MAX_EXECUTION_TIME) {    // Schedule continuation via SQS    await scheduleRemainingWork(items.slice(i));    break;  }
  const batch = items.slice(i, i + BATCH_SIZE);  await processBatch(batch);}

TypeScript Patterns That Saved My Sanity

1. Strict Event Type Definitions

typescript
// Custom type definitions for better IntelliSenseinterface StrictAPIGatewayEvent extends APIGatewayProxyEvent {  pathParameters: { [key: string]: string };  // Never null in our setup  body: string;  // Always present for POST/PUT}
// Type guards for runtime safetyfunction isValidItemData(data: any): data is Partial<Item> {  return typeof data === 'object' &&         data !== null &&         (data.name === undefined || typeof data.name === 'string');}

2. Environment Variable Validation

typescript
// Validate environment at startup, not runtimeinterface Environment {  TABLE_NAME: string;  LOG_LEVEL: 'debug' | 'info' | 'warn' | 'error';  NODE_ENV: 'development' | 'production';}
function validateEnvironment(): Environment {  const env = process.env;
  if (!env.TABLE_NAME) {    throw new Error('TABLE_NAME environment variable is required');  }
  return {    TABLE_NAME: env.TABLE_NAME,    LOG_LEVEL: (env.LOG_LEVEL as any) || 'info',    NODE_ENV: (env.NODE_ENV as any) || 'development',  };}
// Validate once at module loadconst ENV = validateEnvironment();

3. Result Types for Error Handling

typescript
// Rust-inspired Result type for clean error handlingtype Result<T, E = Error> =  | { success: true; data: T }  | { success: false; error: E };
async function getItem(id: string): Promise<Result<Item, string>> {  try {    const result = await docClient.send(new GetCommand({      TableName: ENV.TABLE_NAME,      Key: { PK: 'ITEM', SK: `ITEM#${id}` },    }));
    if (!result.Item) {      return { success: false, error: 'Item not found' };    }
    return { success: true, data: transformDynamoItem(result.Item) };  } catch (error) {    return { success: false, error: error.message };  }}
// Usageconst result = await getItem(id);if (!result.success) {  return createResponse(404, { error: result.error });}// TypeScript knows result.data is Itemconst item = result.data;

Performance Insights from Production Data

After 18 months in production with detailed monitoring:

Cold Start Analysis

  • Average cold start: 850ms
  • P95 cold start: 1,200ms
  • Bundle size impact: 10MB bundle = +400ms cold start
  • Memory impact: 1024MB vs 512MB = -200ms cold start

Cost Breakdown (Monthly)

  • Lambda execution: $89/month (8M invocations)
  • API Gateway: $28/month (8M requests)
  • DynamoDB: $67/month (pay-per-request)
  • CloudWatch logs: $12/month
  • Total: 196/month(vs.196/month (vs. 800/month for EC2 equivalent)

Reliability Metrics

  • Uptime: 99.97% (vs. 99.9% on EC2)
  • Error rate: 0.02% (mostly client errors)
  • P95 response time: 180ms

When NOT to Use Serverless

Serverless isn't always the answer. Here's when I stick with containers:

  1. Long-running processes - Video encoding, large batch jobs
  2. Websocket-heavy apps - Real-time gaming, chat apps
  3. Legacy applications - Complex deployment requirements
  4. Stateful workloads - In-memory caches, sessions
  5. Cold start sensitive - Sub-100ms response requirements

The Deployment Pipeline That Doesn't Break

typescript
// CDK pipeline for zero-downtime deploymentsexport class ServerlessPipeline extends Stack {  constructor(scope: Construct, id: string) {    super(scope, id);
    const pipeline = new CodePipeline(this, 'Pipeline', {      synth: new ShellStep('Synth', {        input: CodePipelineSource.gitHub('yourorg/repo', 'main'),        commands: [          'npm ci',          'npm run build',          'npm run test',          'npx cdk synth',        ],      }),    });
    // Stage deployments with gradual rollout    const testStage = new ServerlessStage(this, 'Test', {      stageName: 'test',    });
    const prodStage = new ServerlessStage(this, 'Prod', {      stageName: 'prod',    });
    pipeline.addStage(testStage, {      post: [        new ShellStep('IntegrationTests', {          commands: [            'npm run test:integration',          ],          envFromCfnOutput: {            API_URL: testStage.apiUrl,          },        }),      ],    });
    pipeline.addStage(prodStage, {      pre: [        new ManualApprovalStep('PromoteToProd'),      ],      post: [        new ShellStep('SmokeTests', {          commands: [            'npm run test:smoke',          ],        }),      ],    });  }}

Final Thoughts

Serverless with TypeScript transformed how our team ships features. We went from weekly deployments to daily deployments. Our AWS costs decreased significantly. Our uptime improved to 99.97%.

The biggest benefit? Reduced operational overhead. Fewer emergency calls about server crashes, minimal capacity planning, and no operating system patching.

The serverless learning curve is steep, but the productivity gains are measurable. Start small, implement comprehensive monitoring from day one, and expect to make mistakes during the learning process.

Ready to dive in? Start with a simple CRUD API, add proper monitoring from day one, and build incrementally as you learn the platform's characteristics.

References

Related Posts