Building a Scalable User Notification System: Architecture and Database Design

A notification feature starts as "send an email when X happens" and turns into a multi-channel delivery problem within a quarter: email, SMS, push, and in-app, each with its own delivery guarantees, retry semantics, and user-preference surface. The architectural mistake is treating this as a template-plus-send problem; the actual problem is a router that has to decide (per user, per channel, per event) whether to fan out, coalesce, suppress, or defer, and has to keep that decision auditable for compliance and support.

This post is part 1 of a series on building a production notification system. It covers the event-driven architecture (producers, router, dispatcher), the database schema for events, preferences, and delivery state, the channel router pattern, and the observability surface that makes failures debuggable before they reach the user.

The Hidden Complexity of "Simple" Notifications

Here's what I thought notifications were when I was younger: trigger event → send message → done. Here's what they are: complex orchestration of user preferences, delivery channels, rate limiting, retry logic, template management, analytics tracking, and regulatory compliance.

The wake-up call usually comes during your first major product launch. You've got 10,000 users suddenly getting welcome emails, password resets, and activity notifications all at once. Your email service starts throttling, your database connection pool maxes out, and users start complaining about duplicate notifications. Sound familiar?

System Architecture: Learning From Production Pain

Let me walk you through the architecture that's served me well across different scales and industries. This isn't theoretical - every component here exists because something broke in production.

Event-Driven Architecture

The first lesson I learned: notifications are not request-response operations. They're fire-and-forget events that need to be processed asynchronously. Here's the event structure that's worked across multiple systems:

typescript

interface NotificationEvent {  id: string;  userId: string;  type: NotificationType;  templateId?: string;  data: Record<string, any>;  priority: 'low' | 'normal' | 'high' | 'critical';  scheduledAt?: Date;  expiresAt?: Date;  metadata: {    source: string;    correlationId: string;    retryCount: number;    maxRetries: number;  };}
enum NotificationType {  PROJECT_UPDATE = 'project_update',  SECURITY_ALERT = 'security_alert',   FEATURE_ANNOUNCEMENT = 'feature_announcement',  SYSTEM_MAINTENANCE = 'system_maintenance',  USER_ACTIVITY = 'user_activity',  INTEGRATION_UPDATE = 'integration_update'}

interface NotificationEvent {  id: string;  userId: string;  type: NotificationType;  templateId?: string;  data: Record<string, any>;  priority: 'low' | 'normal' | 'high' | 'critical';  scheduledAt?: Date;  expiresAt?: Date;  metadata: {    source: string;    correlationId: string;    retryCount: number;    maxRetries: number;  };}
enum NotificationType {  PROJECT_UPDATE = 'project_update',  SECURITY_ALERT = 'security_alert',   FEATURE_ANNOUNCEMENT = 'feature_announcement',  SYSTEM_MAINTENANCE = 'system_maintenance',  USER_ACTIVITY = 'user_activity',  INTEGRATION_UPDATE = 'integration_update'}

The metadata section is crucial. That correlation ID has saved me countless debugging hours when tracing notification flows across distributed systems.

The Notification Engine: Heart of the System

The notification engine is where most of the complexity lives. Here's what I've learned after building several iterations:

typescript

class NotificationEngine {  constructor(    private eventBus: EventBus,    private templateService: TemplateService,    private preferenceManager: PreferenceManager,    private rateLimiter: RateLimiter,    private channelRouter: ChannelRouter,    private analytics: AnalyticsService  ) {}
  async processEvent(event: NotificationEvent): Promise<void> {    try {      // Check if user exists and is active      const user = await this.getUserWithPreferences(event.userId);      if (!user?.isActive) {        await this.analytics.trackSkipped(event.id, 'user_inactive');        return;      }
      // Apply user preferences filtering      const enabledChannels = await this.preferenceManager        .getEnabledChannels(event.userId, event.type);            if (enabledChannels.length === 0) {        await this.analytics.trackSkipped(event.id, 'all_channels_disabled');        return;      }
      // Rate limiting check      const rateLimitResult = await this.rateLimiter        .checkLimits(event.userId, event.type);            if (!rateLimitResult.allowed) {        await this.scheduleRetry(event, rateLimitResult.retryAfter);        return;      }
      // Process each enabled channel      const deliveryPromises = enabledChannels.map(channel =>         this.processChannel(event, channel, user)      );
      const results = await Promise.allSettled(deliveryPromises);      await this.analytics.trackDeliveryResults(event.id, results);
    } catch (error) {      await this.handleProcessingError(event, error);    }  }
  private async processChannel(    event: NotificationEvent,    channel: NotificationChannel,    user: User  ): Promise<DeliveryResult> {    // Template rendering with user data    const template = await this.templateService.getTemplate(      event.type,      channel,      user.locale    );
    const renderedContent = await this.templateService.render(      template,      { ...event.data, user }    );
    // Route to appropriate channel handler    return await this.channelRouter.deliver(      channel,      user,      renderedContent,      event.metadata    );  }}

class NotificationEngine {  constructor(    private eventBus: EventBus,    private templateService: TemplateService,    private preferenceManager: PreferenceManager,    private rateLimiter: RateLimiter,    private channelRouter: ChannelRouter,    private analytics: AnalyticsService  ) {}
  async processEvent(event: NotificationEvent): Promise<void> {    try {      // Check if user exists and is active      const user = await this.getUserWithPreferences(event.userId);      if (!user?.isActive) {        await this.analytics.trackSkipped(event.id, 'user_inactive');        return;      }
      // Apply user preferences filtering      const enabledChannels = await this.preferenceManager        .getEnabledChannels(event.userId, event.type);            if (enabledChannels.length === 0) {        await this.analytics.trackSkipped(event.id, 'all_channels_disabled');        return;      }
      // Rate limiting check      const rateLimitResult = await this.rateLimiter        .checkLimits(event.userId, event.type);            if (!rateLimitResult.allowed) {        await this.scheduleRetry(event, rateLimitResult.retryAfter);        return;      }
      // Process each enabled channel      const deliveryPromises = enabledChannels.map(channel =>         this.processChannel(event, channel, user)      );
      const results = await Promise.allSettled(deliveryPromises);      await this.analytics.trackDeliveryResults(event.id, results);
    } catch (error) {      await this.handleProcessingError(event, error);    }  }
  private async processChannel(    event: NotificationEvent,    channel: NotificationChannel,    user: User  ): Promise<DeliveryResult> {    // Template rendering with user data    const template = await this.templateService.getTemplate(      event.type,      channel,      user.locale    );
    const renderedContent = await this.templateService.render(      template,      { ...event.data, user }    );
    // Route to appropriate channel handler    return await this.channelRouter.deliver(      channel,      user,      renderedContent,      event.metadata    );  }}

The key insight here: every operation can fail, and you need to handle failures gracefully while maintaining visibility into what's happening.

Database Design: The Foundation That Makes or Breaks You

I've redesigned notification databases three times across different companies. Each time, I learned something new about what actually matters in production. Here's the schema that's stood the test of time and scale:

Core Tables

sql

-- Users table (assuming it exists)CREATE TABLE users (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    email VARCHAR(255) UNIQUE NOT NULL,    phone VARCHAR(20),    locale VARCHAR(10) DEFAULT 'en',    timezone VARCHAR(50) DEFAULT 'UTC',    is_active BOOLEAN DEFAULT true,    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW());
-- The preference system - this gets complex fastCREATE TABLE notification_preferences (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    user_id UUID REFERENCES users(id) ON DELETE CASCADE,    notification_type VARCHAR(100) NOT NULL,    channel VARCHAR(50) NOT NULL,    enabled BOOLEAN DEFAULT true,    frequency VARCHAR(20) DEFAULT 'immediate', -- immediate, daily, weekly    quiet_hours_start TIME DEFAULT '22:00:00',    quiet_hours_end TIME DEFAULT '08:00:00',    metadata JSONB DEFAULT '{}', -- for channel-specific settings    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    UNIQUE(user_id, notification_type, channel));
-- Template management - localization is crucialCREATE TABLE notification_templates (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    name VARCHAR(255) NOT NULL,    notification_type VARCHAR(100) NOT NULL,    channel VARCHAR(50) NOT NULL,    locale VARCHAR(10) DEFAULT 'en',    subject VARCHAR(500),    body TEXT NOT NULL,    variables JSONB DEFAULT '{}', -- expected variables    is_active BOOLEAN DEFAULT true,    version INTEGER DEFAULT 1,    created_by UUID REFERENCES users(id),    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    UNIQUE(notification_type, channel, locale, version));

-- Users table (assuming it exists)CREATE TABLE users (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    email VARCHAR(255) UNIQUE NOT NULL,    phone VARCHAR(20),    locale VARCHAR(10) DEFAULT 'en',    timezone VARCHAR(50) DEFAULT 'UTC',    is_active BOOLEAN DEFAULT true,    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW());
-- The preference system - this gets complex fastCREATE TABLE notification_preferences (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    user_id UUID REFERENCES users(id) ON DELETE CASCADE,    notification_type VARCHAR(100) NOT NULL,    channel VARCHAR(50) NOT NULL,    enabled BOOLEAN DEFAULT true,    frequency VARCHAR(20) DEFAULT 'immediate', -- immediate, daily, weekly    quiet_hours_start TIME DEFAULT '22:00:00',    quiet_hours_end TIME DEFAULT '08:00:00',    metadata JSONB DEFAULT '{}', -- for channel-specific settings    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    UNIQUE(user_id, notification_type, channel));
-- Template management - localization is crucialCREATE TABLE notification_templates (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    name VARCHAR(255) NOT NULL,    notification_type VARCHAR(100) NOT NULL,    channel VARCHAR(50) NOT NULL,    locale VARCHAR(10) DEFAULT 'en',    subject VARCHAR(500),    body TEXT NOT NULL,    variables JSONB DEFAULT '{}', -- expected variables    is_active BOOLEAN DEFAULT true,    version INTEGER DEFAULT 1,    created_by UUID REFERENCES users(id),    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    UNIQUE(notification_type, channel, locale, version));

Event Storage and Tracking

The event storage design is where I've made my biggest mistakes. Here's what I learned:

sql

-- Main event table - this gets HUGECREATE TABLE notification_events (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    user_id UUID REFERENCES users(id) ON DELETE CASCADE,    notification_type VARCHAR(100) NOT NULL,    template_id UUID REFERENCES notification_templates(id),    priority VARCHAR(20) DEFAULT 'normal',    data JSONB DEFAULT '{}',    scheduled_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    expires_at TIMESTAMP WITH TIME ZONE,    status VARCHAR(20) DEFAULT 'pending',    processed_at TIMESTAMP WITH TIME ZONE,    correlation_id VARCHAR(255), -- for tracing    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    INDEX idx_notification_events_user_created (user_id, created_at DESC),    INDEX idx_notification_events_status (status, scheduled_at),    INDEX idx_notification_events_correlation (correlation_id));
-- Delivery tracking - separate for performanceCREATE TABLE notification_deliveries (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    event_id UUID REFERENCES notification_events(id) ON DELETE CASCADE,    channel VARCHAR(50) NOT NULL,    status VARCHAR(20) DEFAULT 'pending', -- pending, sent, delivered, failed, bounced    attempt_count INTEGER DEFAULT 0,    max_attempts INTEGER DEFAULT 3,    next_retry_at TIMESTAMP WITH TIME ZONE,    sent_at TIMESTAMP WITH TIME ZONE,    delivered_at TIMESTAMP WITH TIME ZONE,    failed_at TIMESTAMP WITH TIME ZONE,    error_code VARCHAR(50),    error_message TEXT,    provider_id VARCHAR(255), -- external provider message ID    provider_response JSONB,    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    INDEX idx_deliveries_event_channel (event_id, channel),    INDEX idx_deliveries_retry (status, next_retry_at) WHERE status = 'pending');
-- Analytics aggregation table - learned this the hard wayCREATE TABLE notification_metrics (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    date DATE NOT NULL,    hour SMALLINT NOT NULL, -- 0-23    notification_type VARCHAR(100) NOT NULL,    channel VARCHAR(50) NOT NULL,    status VARCHAR(20) NOT NULL,    count INTEGER DEFAULT 1,    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    UNIQUE(date, hour, notification_type, channel, status));

-- Main event table - this gets HUGECREATE TABLE notification_events (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    user_id UUID REFERENCES users(id) ON DELETE CASCADE,    notification_type VARCHAR(100) NOT NULL,    template_id UUID REFERENCES notification_templates(id),    priority VARCHAR(20) DEFAULT 'normal',    data JSONB DEFAULT '{}',    scheduled_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    expires_at TIMESTAMP WITH TIME ZONE,    status VARCHAR(20) DEFAULT 'pending',    processed_at TIMESTAMP WITH TIME ZONE,    correlation_id VARCHAR(255), -- for tracing    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    INDEX idx_notification_events_user_created (user_id, created_at DESC),    INDEX idx_notification_events_status (status, scheduled_at),    INDEX idx_notification_events_correlation (correlation_id));
-- Delivery tracking - separate for performanceCREATE TABLE notification_deliveries (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    event_id UUID REFERENCES notification_events(id) ON DELETE CASCADE,    channel VARCHAR(50) NOT NULL,    status VARCHAR(20) DEFAULT 'pending', -- pending, sent, delivered, failed, bounced    attempt_count INTEGER DEFAULT 0,    max_attempts INTEGER DEFAULT 3,    next_retry_at TIMESTAMP WITH TIME ZONE,    sent_at TIMESTAMP WITH TIME ZONE,    delivered_at TIMESTAMP WITH TIME ZONE,    failed_at TIMESTAMP WITH TIME ZONE,    error_code VARCHAR(50),    error_message TEXT,    provider_id VARCHAR(255), -- external provider message ID    provider_response JSONB,    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    INDEX idx_deliveries_event_channel (event_id, channel),    INDEX idx_deliveries_retry (status, next_retry_at) WHERE status = 'pending');
-- Analytics aggregation table - learned this the hard wayCREATE TABLE notification_metrics (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    date DATE NOT NULL,    hour SMALLINT NOT NULL, -- 0-23    notification_type VARCHAR(100) NOT NULL,    channel VARCHAR(50) NOT NULL,    status VARCHAR(20) NOT NULL,    count INTEGER DEFAULT 1,    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),    UNIQUE(date, hour, notification_type, channel, status));

Performance Lessons from Production

Here are the indexing strategies that actually matter when you're processing millions of notifications:

sql

-- Critical indexes based on query patternsCREATE INDEX idx_events_user_type_created ON notification_events(user_id, notification_type, created_at DESC);
CREATE INDEX idx_events_processing_queue ON notification_events(status, scheduled_at) WHERE status IN ('pending', 'retry');
CREATE INDEX idx_deliveries_retry_queue ON notification_deliveries(next_retry_at, status) WHERE status = 'pending' AND next_retry_at IS NOT NULL;
-- Partial indexes for common queriesCREATE INDEX idx_events_recent_active ON notification_events(created_at DESC) WHERE created_at > NOW() - INTERVAL '7 days';
-- For analytics queriesCREATE INDEX idx_metrics_time_type ON notification_metrics(date, hour, notification_type);

-- Critical indexes based on query patternsCREATE INDEX idx_events_user_type_created ON notification_events(user_id, notification_type, created_at DESC);
CREATE INDEX idx_events_processing_queue ON notification_events(status, scheduled_at) WHERE status IN ('pending', 'retry');
CREATE INDEX idx_deliveries_retry_queue ON notification_deliveries(next_retry_at, status) WHERE status = 'pending' AND next_retry_at IS NOT NULL;
-- Partial indexes for common queriesCREATE INDEX idx_events_recent_active ON notification_events(created_at DESC) WHERE created_at > NOW() - INTERVAL '7 days';
-- For analytics queriesCREATE INDEX idx_metrics_time_type ON notification_metrics(date, hour, notification_type);

The partial indexes are crucial. Without them, your analytics queries will start timing out when you hit millions of events.

User Preferences: More Complex Than You Think

User preferences seem straightforward until you hit edge cases. Here's the preference manager that's handled real-world complexity:

typescript

class PreferenceManager {  async getEnabledChannels(    userId: string,     notificationType: string  ): Promise<NotificationChannel[]> {        // Check global user preferences    const userPrefs = await this.db.query(`      SELECT np.channel, np.enabled, np.frequency,              np.quiet_hours_start, np.quiet_hours_end,             u.timezone      FROM notification_preferences np      JOIN users u ON u.id = np.user_id      WHERE np.user_id = $1 AND np.notification_type = $2    `, [userId, notificationType]);
    if (userPrefs.length === 0) {      // Use default preferences for this notification type      return this.getDefaultChannels(notificationType);    }
    const currentTime = new Date();    const enabledChannels: NotificationChannel[] = [];
    for (const pref of userPrefs) {      if (!pref.enabled) continue;
      // Check quiet hours      if (this.isInQuietHours(currentTime, pref)) {        // Check if this is a critical notification that overrides quiet hours        if (!this.isCriticalNotification(notificationType)) {          continue;        }      }
      // Check frequency preferences      if (!this.shouldSendBasedOnFrequency(userId, pref.frequency, notificationType)) {        continue;      }
      enabledChannels.push(pref.channel as NotificationChannel);    }
    return enabledChannels;  }
  private isInQuietHours(currentTime: Date, pref: any): boolean {    // Convert current time to user's timezone    const userTime = moment(currentTime)      .tz(pref.timezone || 'UTC')      .format('HH:mm:ss');
    const quietStart = pref.quiet_hours_start;    const quietEnd = pref.quiet_hours_end;
    // Handle quiet hours that cross midnight    if (quietStart > quietEnd) {      return userTime >= quietStart || userTime <= quietEnd;    }
    return userTime >= quietStart && userTime <= quietEnd;  }}

class PreferenceManager {  async getEnabledChannels(    userId: string,     notificationType: string  ): Promise<NotificationChannel[]> {        // Check global user preferences    const userPrefs = await this.db.query(`      SELECT np.channel, np.enabled, np.frequency,              np.quiet_hours_start, np.quiet_hours_end,             u.timezone      FROM notification_preferences np      JOIN users u ON u.id = np.user_id      WHERE np.user_id = $1 AND np.notification_type = $2    `, [userId, notificationType]);
    if (userPrefs.length === 0) {      // Use default preferences for this notification type      return this.getDefaultChannels(notificationType);    }
    const currentTime = new Date();    const enabledChannels: NotificationChannel[] = [];
    for (const pref of userPrefs) {      if (!pref.enabled) continue;
      // Check quiet hours      if (this.isInQuietHours(currentTime, pref)) {        // Check if this is a critical notification that overrides quiet hours        if (!this.isCriticalNotification(notificationType)) {          continue;        }      }
      // Check frequency preferences      if (!this.shouldSendBasedOnFrequency(userId, pref.frequency, notificationType)) {        continue;      }
      enabledChannels.push(pref.channel as NotificationChannel);    }
    return enabledChannels;  }
  private isInQuietHours(currentTime: Date, pref: any): boolean {    // Convert current time to user's timezone    const userTime = moment(currentTime)      .tz(pref.timezone || 'UTC')      .format('HH:mm:ss');
    const quietStart = pref.quiet_hours_start;    const quietEnd = pref.quiet_hours_end;
    // Handle quiet hours that cross midnight    if (quietStart > quietEnd) {      return userTime >= quietStart || userTime <= quietEnd;    }
    return userTime >= quietStart && userTime <= quietEnd;  }}

The timezone handling alone took me three iterations to get right. Don't underestimate how complex user preferences become in a global application.

Template System: Localization and Personalization

Templates are where the rubber meets the road for user experience. Here's the template service that handles localization, personalization, and A/B testing:

typescript

interface Template {  id: string;  name: string;  type: string;  channel: string;  locale: string;  subject?: string;  body: string;  variables: Record<string, TemplateVariable>;  abTest?: ABTestConfig;}
class TemplateService {  async getTemplate(    notificationType: string,    channel: NotificationChannel,    locale: string = 'en'  ): Promise<Template> {        // Try to get localized template first    let template = await this.db.findTemplate({      type: notificationType,      channel,      locale,      isActive: true    });
    // Fallback to English if no localized version    if (!template && locale !== 'en') {      template = await this.db.findTemplate({        type: notificationType,        channel,        locale: 'en',        isActive: true      });    }
    if (!template) {      throw new Error(`No template found for ${notificationType}/${channel}/${locale}`);    }
    return template;  }
  async render(template: Template, data: Record<string, any>): Promise<RenderedContent> {    try {      // Validate required variables      await this.validateTemplateData(template, data);
      // Process template with Handlebars or similar      const subject = template.subject         ? await this.renderString(template.subject, data)        : undefined;
      const body = await this.renderString(template.body, data);
      return {        subject,        body,        templateId: template.id,        locale: template.locale      };
    } catch (error) {      // Log template rendering errors for debugging      await this.logger.error('Template rendering failed', {        templateId: template.id,        error: error.message,        data: this.sanitizeDataForLogging(data)      });            throw new TemplateRenderError(`Failed to render template ${template.id}`, error);    }  }}

interface Template {  id: string;  name: string;  type: string;  channel: string;  locale: string;  subject?: string;  body: string;  variables: Record<string, TemplateVariable>;  abTest?: ABTestConfig;}
class TemplateService {  async getTemplate(    notificationType: string,    channel: NotificationChannel,    locale: string = 'en'  ): Promise<Template> {        // Try to get localized template first    let template = await this.db.findTemplate({      type: notificationType,      channel,      locale,      isActive: true    });
    // Fallback to English if no localized version    if (!template && locale !== 'en') {      template = await this.db.findTemplate({        type: notificationType,        channel,        locale: 'en',        isActive: true      });    }
    if (!template) {      throw new Error(`No template found for ${notificationType}/${channel}/${locale}`);    }
    return template;  }
  async render(template: Template, data: Record<string, any>): Promise<RenderedContent> {    try {      // Validate required variables      await this.validateTemplateData(template, data);
      // Process template with Handlebars or similar      const subject = template.subject         ? await this.renderString(template.subject, data)        : undefined;
      const body = await this.renderString(template.body, data);
      return {        subject,        body,        templateId: template.id,        locale: template.locale      };
    } catch (error) {      // Log template rendering errors for debugging      await this.logger.error('Template rendering failed', {        templateId: template.id,        error: error.message,        data: this.sanitizeDataForLogging(data)      });            throw new TemplateRenderError(`Failed to render template ${template.id}`, error);    }  }}

Rate Limiting: Protecting Users and Providers

Rate limiting is where you balance user experience with system stability. Here's what I've learned about implementing effective rate limiting:

typescript

interface RateLimitConfig {  notificationType: string;  channel: string;  limits: {    perMinute: number;    perHour: number;    perDay: number;  };  burstAllowance: number;}
class RateLimiter {  constructor(private redis: Redis, private configs: RateLimitConfig[]) {}
  async checkLimits(    userId: string,     notificationType: string  ): Promise<RateLimitResult> {        const config = this.getConfig(notificationType);    if (!config) {      return { allowed: true, remainingToday: Infinity };    }
    const now = Date.now();    const keys = {      minute: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 60000)}`,      hour: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 3600000)}`,      day: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 86400000)}`    };
    // Use Redis pipeline for atomic checks    const pipeline = this.redis.pipeline();    pipeline.incr(keys.minute);    pipeline.expire(keys.minute, 60);    pipeline.incr(keys.hour);    pipeline.expire(keys.hour, 3600);    pipeline.incr(keys.day);    pipeline.expire(keys.day, 86400);
    const results = await pipeline.exec();    const counts = {      minute: results[0][1] as number,      hour: results[2][1] as number,      day: results[4][1] as number    };
    // Check against limits    if (counts.minute > config.limits.perMinute ||        counts.hour > config.limits.perHour ||        counts.day > config.limits.perDay) {            return {        allowed: false,        retryAfter: this.calculateRetryAfter(counts, config),        remainingToday: Math.max(0, config.limits.perDay - counts.day)      };    }
    return {      allowed: true,      remainingToday: config.limits.perDay - counts.day    };  }}

interface RateLimitConfig {  notificationType: string;  channel: string;  limits: {    perMinute: number;    perHour: number;    perDay: number;  };  burstAllowance: number;}
class RateLimiter {  constructor(private redis: Redis, private configs: RateLimitConfig[]) {}
  async checkLimits(    userId: string,     notificationType: string  ): Promise<RateLimitResult> {        const config = this.getConfig(notificationType);    if (!config) {      return { allowed: true, remainingToday: Infinity };    }
    const now = Date.now();    const keys = {      minute: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 60000)}`,      hour: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 3600000)}`,      day: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 86400000)}`    };
    // Use Redis pipeline for atomic checks    const pipeline = this.redis.pipeline();    pipeline.incr(keys.minute);    pipeline.expire(keys.minute, 60);    pipeline.incr(keys.hour);    pipeline.expire(keys.hour, 3600);    pipeline.incr(keys.day);    pipeline.expire(keys.day, 86400);
    const results = await pipeline.exec();    const counts = {      minute: results[0][1] as number,      hour: results[2][1] as number,      day: results[4][1] as number    };
    // Check against limits    if (counts.minute > config.limits.perMinute ||        counts.hour > config.limits.perHour ||        counts.day > config.limits.perDay) {            return {        allowed: false,        retryAfter: this.calculateRetryAfter(counts, config),        remainingToday: Math.max(0, config.limits.perDay - counts.day)      };    }
    return {      allowed: true,      remainingToday: config.limits.perDay - counts.day    };  }}

What I Wish I'd Known Starting Out

After building notification systems that handle millions of messages daily, here are the lessons that would have saved me months of refactoring:

Start with idempotency: Every notification operation should be idempotent. Users will complain about duplicates more than missing notifications.
Design for observability: You'll spend more time debugging delivery issues than building features. Correlation IDs and detailed logging aren't optional.
Separate concerns early: Don't let your notification engine become a monolith. Each channel should be independently deployable and scalable.
Plan for data retention: Notification data grows fast. Have a retention and archiving strategy from day one.
User preferences are complex: What seems like a simple on/off switch becomes timezone-aware, frequency-based, channel-specific preferences with quiet hours and emergency overrides.

In the next part of this series, we'll dive into the real-time delivery mechanisms - WebSocket connections, push notifications, and the channel-specific implementations that make it all work. We'll also cover the production incidents that taught me why retry logic and circuit breakers aren't just nice-to-have features.

The foundation we've built here might seem over-engineered for a simple notification system, but trust me - when you're debugging why 50,000 users didn't get their password reset emails during a product launch, you'll be grateful for every piece of observability and resilience we've baked in.

References

typescriptlang.org - TypeScript Handbook and language reference.
github.com - TypeScript project wiki (FAQ and design notes).
postgresql.org - PostgreSQL official documentation.
martinfowler.com - Martin Fowler on software architecture (index).
developer.mozilla.org - MDN Web Docs (web platform reference).
semver.org - Semantic Versioning specification.

Building a Scalable User Notification System: Architecture and Database Design

The Hidden Complexity of "Simple" Notifications

System Architecture: Learning From Production Pain

Event-Driven Architecture

The Notification Engine: Heart of the System

Database Design: The Foundation That Makes or Breaks You

Core Tables

Event Storage and Tracking

Performance Lessons from Production

User Preferences: More Complex Than You Think

Template System: Localization and Personalization

Rate Limiting: Protecting Users and Providers

What I Wish I'd Known Starting Out

References