Skip to content

From RFC to Production: What They Don't Tell You About Implementation

An honest take on the gap between beautiful RFC designs and messy production reality, featuring real-world lessons from implementing notification systems at scale

Abstract

RFCs rarely survive contact with production unchanged, and that's not necessarily a problem. Through examining notification system implementations, we can learn how elegant designs evolve when they meet organizational constraints, timeline pressures, and unexpected requirements. This exploration reveals patterns that help bridge the gap between theoretical design and practical implementation.

Situation: The Beautiful RFC vs. Production Reality

You know that feeling when you're reading through a beautifully crafted RFC, nodding along to the elegant architecture diagrams, and thinking "This is it, this is the design that will finally work perfectly"? Then six months later you're knee-deep in production issues, the timeline has doubled, and that pristine database schema looks like it went through a blender?

This pattern emerges repeatedly across system implementations. The gap between RFC and production isn't a bug - it's a feature of building complex systems with teams under business pressures. Understanding this gap helps us plan more effectively and set realistic expectations.

Note: The following examples are adapted from multiple notification system implementations across different organizations. While specific details may vary, the patterns and challenges described are representative of common experiences in this domain.

Task: Building a Notification System from RFC to Reality

The task seemed straightforward from the RFC perspective. A comprehensive notification system with clean architecture diagrams, well-planned database schemas, and phased rollout strategies. The specifications looked thorough and the timeline appeared conservative:

typescript
// The RFC specificationsinterface NotificationSystemGoals {  deliveryTime: '<100ms for in-app, <5s for email',  throughput: '10,000+ notifications per second',  uptime: '99.9% availability',  timeline: '12 weeks with 2 developers',  budget: '$120,000-180,000'}
// What emerged in productioninterface ProductionReality {  deliveryTime: '2-3s for in-app on good days, 30s+ during peaks',  throughput: 'Started at 500/sec, took 6 months to reach 5,000/sec',  uptime: '97% first quarter, 99% after year one',  timeline: '8 months with 4 developers plus 2 contractors',  budget: '$400,000+ and still counting maintenance costs'}

The RFC appeared comprehensive, covering rate limiting, deduplication, preference management, and user experience considerations like quiet hours. The phased approach seemed reasonable - core infrastructure in 4 weeks felt achievable.

Action: Implementation Challenges and Adaptations

Database Schema Evolution

The initial database schema design emphasized clean normalization with proper foreign keys and constraints:

sql
-- Initial RFC schema designCREATE TABLE notification_events (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    user_id UUID REFERENCES users(id) ON DELETE CASCADE,    notification_type VARCHAR(100) NOT NULL,    template_id UUID REFERENCES notification_templates(id),    data JSONB DEFAULT '{}',    status VARCHAR(20) DEFAULT 'pending',    sent_at TIMESTAMP,    delivered_at TIMESTAMP,    read_at TIMESTAMP,    created_at TIMESTAMP DEFAULT NOW());

Three months into production, the schema had evolved significantly:

sql
-- Schema after production adaptationsCREATE TABLE notification_events (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    user_id UUID, -- Foreign key removed due to performance issues    notification_type VARCHAR(100),    notification_type_v2 VARCHAR(255), -- Migration in progress    template_id UUID,    template_id_v2 BIGINT, -- Different team used different ID type    data JSONB DEFAULT '{}',    data_compressed BYTEA, -- Added when JSONB got too large    status VARCHAR(20) DEFAULT 'pending',    status_v2 VARCHAR(50), -- More statuses than expected    priority INTEGER DEFAULT 0, -- Not in RFC, critical for production    retry_count INTEGER DEFAULT 0, -- Not in RFC, essential for debugging    channel VARCHAR(50), -- Denormalized for query performance    correlation_id UUID, -- Added for distributed tracing    partition_key INTEGER, -- Added for sharding    sent_at TIMESTAMP,    delivered_at TIMESTAMP,    read_at TIMESTAMP,    failed_at TIMESTAMP, -- Not in RFC, very much needed    expires_at TIMESTAMP, -- Not in RFC, prevented infinite growth    created_at TIMESTAMP DEFAULT NOW(),    updated_at TIMESTAMP DEFAULT NOW() -- Added after debugging nightmares);
-- Plus 15 indexes we didn't anticipateCREATE INDEX CONCURRENTLY idx_notification_events_user_created ON notification_events(user_id, created_at DESC) WHERE status != 'deleted';CREATE INDEX CONCURRENTLY idx_notification_events_correlation ON notification_events(correlation_id) WHERE correlation_id IS NOT NULL;-- ... and 13 more

Each schema change addressed production incidents, performance bottlenecks, or requirements that emerged during implementation. These adaptations reflect the natural evolution from theoretical design to operational system.

WebSocket Connection Management Complexity

The RFC specified WebSocket-based delivery for optimal performance. The initial implementation approach was straightforward:

typescript
// RFC's WebSocket implementationclass NotificationWebSocketManager {  private connections: Map<string, WebSocket> = new Map();    async sendNotification(userId: string, notification: NotificationEvent) {    const connection = this.connections.get(userId);    if (connection && connection.readyState === WebSocket.OPEN) {      connection.send(JSON.stringify({        type: 'notification',        data: notification      }));    }  }}

Production requirements revealed additional complexity. After addressing connection management challenges during mobile app deployments, the implementation evolved:

typescript
// Production implementation addressing edge casesclass NotificationWebSocketManager {  private connections: Map<string, Set<WebSocketConnection>> = new Map();  private connectionMetadata: Map<string, ConnectionMetadata> = new Map();  private healthChecks: Map<string, NodeJS.Timeout> = new Map();  private rateLimiters: Map<string, RateLimiter> = new Map();  private deadLetterQueue: Queue<FailedNotification>;  private circuit: CircuitBreaker;    async sendNotification(userId: string, notification: NotificationEvent) {    // 200+ lines of defensive programming    const connections = this.connections.get(userId);    if (!connections || connections.size === 0) {      await this.queueForLaterDelivery(userId, notification);      return;    }        // Handle multiple connections per user (mobile + web + tablet)    const results = await Promise.allSettled(      Array.from(connections).map(async (conn) => {        try {          // Check connection health          if (!this.isConnectionHealthy(conn)) {            await this.reconnectOrEvict(conn);            throw new Error('Unhealthy connection');          }                    // Rate limiting per connection          const limiter = this.getRateLimiter(conn.id);          if (!await limiter.tryAcquire()) {            await this.backpressure(conn, notification);            return;          }                    // Circuit breaker for cascading failures          return await this.circuit.fire(async () => {            // Message size validation (learned this the hard way)            const message = this.serializeNotification(notification);            if (message.length > MAX_MESSAGE_SIZE) {              const chunks = this.chunkMessage(message);              for (const chunk of chunks) {                await this.sendChunk(conn, chunk);              }            } else {              await this.sendMessage(conn, message);            }          });        } catch (error) {          await this.handleDeliveryFailure(conn, notification, error);        }      })    );        // Track delivery metrics    await this.recordDeliveryMetrics(userId, notification, results);  }    // Plus 50+ other methods for handling edge cases}

Each addition addressed specific production challenges: circuit breakers for cascading failures, message chunking for large payloads, and sophisticated rate limiting for notification storms. These patterns emerge consistently when simple designs meet complex operational requirements.

Timeline and Scope Evolution

The RFC outlined a structured development approach:

  • Phase 1 (Weeks 1-4): Core Infrastructure
  • Phase 2 (Weeks 5-8): Advanced Features
  • Phase 3 (Weeks 9-12): Integration & Optimization

The implementation timeline revealed different patterns:

Weeks 1-4: Infrastructure Foundation Challenges

Environment setup and capacity planning consumed more time than anticipated. Database throughput requirements exceeded initial assumptions, and competing production priorities affected team availability.

Weeks 5-12: Scope Expansion

Early demonstrations generated enthusiasm and additional requirements. Channel diversity expanded beyond initial specifications as business needs emerged during development.

typescript
// Original scopeconst originalChannels = ['in_app', 'email', 'push'];
// Month 3 scopeconst actualChannels = [  'in_app',   'email',   'push',   'sms',  // Added week 6  'slack',  // Added week 8  'teams',  // Added week 10  'webhook',  // Added week 11  'discord',  // Added week 14 (yes, we were already late)  'voice_call'  // Added week 20 (for critical security alerts)];

Months 4-6: Integration Complexity

The clean API design assumed consistent authentication patterns across services. Production revealed three different authentication systems requiring unified notification support.

typescript
// RFC assumptioninterface AuthContext {  userId: string;  token: string;}
// Production realitytype AuthContext =   | { type: 'jwt'; userId: string; token: string; claims: JWTClaims }  | { type: 'oauth2'; userId: string; accessToken: string; refreshToken: string; expiresAt: Date }  | { type: 'legacy'; sessionId: string; userId?: string; cookieData: LegacyCookie }  | { type: 'service_account'; serviceId: string; apiKey: string }  | { type: 'anonymous'; temporaryId: string; ipAddress: string };
// Each authentication pattern required specialized handling:// rate limiting, security validation, and audit requirements

Months 7-8: Performance Optimization

While functional, the system required significant performance work to meet throughput requirements. Template rendering emerged as an unexpected bottleneck, with personalization features requiring multiple API calls per notification.

Team Scaling and Organizational Changes

The RFC specified "2 developers for 12 weeks." The implementation team evolved differently:

  • 2 senior engineers (supposed to be full-time, averaged 60% due to production support)
  • 1 junior engineer (added month 2, spent month 3 learning the codebase)
  • 2 contractors (added month 4 for "quick wins," spent month 5 fixing their code)
  • 1 DevOps engineer (supposedly "consulting," became full-time by month 3)
  • 1 database expert (brought in month 5 for performance crisis)
  • Product manager (changed twice during the project)
  • 3 different engineering managers (reorg happened in month 6)

Team changes introduced context transfer challenges and architectural reviews. Contractor contributions required additional integration work, and organizational restructuring prompted design reassessment that affected project momentum.

Monitoring Requirements Discovery

The RFC monitoring section covered standard metrics: delivery rate, response time, and error rate. Production operation revealed additional observability requirements:

typescript
// RFC monitoring planconst plannedMetrics = [  'delivery_rate',  'response_time',   'error_rate',  'throughput'];
// What we actually monitorconst productionMetrics = [  // Basic metrics (from RFC)  'delivery_rate_by_channel_by_priority_by_user_segment',  'response_time_p50_p95_p99_p999',  'error_rate_by_type_by_service_by_retry_count',    // The metrics that actually matter  'template_render_time_by_template_by_variables_count',  'database_connection_pool_wait_time',  'redis_operation_time_by_operation_type',  'webhook_retry_backoff_effectiveness',  'notification_staleness_at_delivery',  'user_preference_cache_hit_rate',  'deduplication_effectiveness_by_time_window',  'rate_limit_rejection_by_reason',  'circuit_breaker_state_transitions',  'message_size_distribution_by_channel',  'websocket_reconnection_storms',  'push_token_invalidation_rate',  'email_bounce_classification',  'notification_feedback_loop_latency',  'cost_per_notification_by_channel',  'regulatory_compliance_audit_completeness',    // The weird ones we needed after specific incidents  'mobile_app_version_vs_notification_compatibility',  'timezone_calculation_accuracy',  'emoji_rendering_failures_by_client',  'notification_delivery_during_database_failover',  'memory_leak_in_template_cache',  'thundering_herd_detection'];

Each additional metric addresses specific operational challenges that emerged during production use, highlighting the difference between design-time and runtime observability needs.

Technical Debt Accumulation Patterns

Technical debt considerations weren't explicit in the RFC. By month 8, several patterns had emerged:

Template System Complexity

Multiple template engines emerged to support different team requirements, creating a hybrid system that required ongoing maintenance.

typescript
// Multi-engine template management complexityclass NotificationTemplateManager {  private mustacheTemplates: Map<string, MustacheTemplate>;  // Original system  private handlebarsTemplates: Map<string, HandlebarsTemplate>; // Added for marketing  private reactEmailTemplates: Map<string, ReactEmailTemplate>; // Added for pretty emails    async render(templateId: string, data: any): Promise<string> {    // 150 lines of logic to figure out which template engine to use,    // handle edge cases, maintain backwards compatibility,    // and work around bugs we can't fix without breaking production        // This comment has been here since month 4:    // TODO: Unify template systems (estimated: 2 weeks)    // Actual estimate after investigation: 3 months + migration plan  }}

Schema Migration Challenges

The evolution from initial to optimized schema required careful migration planning. Running parallel schemas during transition introduced synchronization complexity.

sql
-- The migration nightmareBEGIN;  -- Step 1 of 47 in the migration plan  INSERT INTO notification_events_v2   SELECT     id,    user_id,    -- 50 lines of complex transformation logic    CASE       WHEN notification_type IN ('old_type_1', 'old_type_2') THEN 'new_type_1'      WHEN notification_type LIKE 'legacy_%' THEN REPLACE(notification_type, 'legacy_', 'classic_')      -- 20 more WHEN clauses    END as notification_type_v2,    -- More transformations...  FROM notification_events   WHERE created_at > NOW() - INTERVAL '1 hour'    AND status != 'migrated'    AND NOT EXISTS (      SELECT 1 FROM notification_events_v2       WHERE notification_events_v2.id = notification_events.id    );    -- Update migration status  UPDATE migration_status   SET last_run = NOW(),       records_migrated = records_migrated + row_count,      estimated_completion = NOW() + (remaining_records / current_rate * INTERVAL '1 second')  WHERE migration_name = 'notification_schema_v2';    -- Check for conflicts  -- Handle rollback scenarios  -- Update monitoring metrics  -- 100 more lines...COMMIT;

Result: Lessons from Implementation Experience

The RFC specified technical success criteria: 99.9% uptime, sub-100ms delivery, and 10,000 notifications per second. Achievement of these targets revealed that user and business metrics were equally important.

What actually mattered:

  • User happiness: We had 99% delivery rate but users hated the notifications because they were poorly timed
  • Developer productivity: Other teams couldn't integrate with our "clean" API without extensive hand-holding
  • Operational burden: The system required constant babysitting despite all our automation
  • Business value: Marketing couldn't use half the features because they were too complex
typescript
// What we optimized for (from RFC)const technicalMetrics = {  uptime: 99.9,  deliveryTime: 95, // ms  throughput: 10000, // per second  errorRate: 0.1 // percent};
// What actually matteredconst businessMetrics = {  userNotificationDisableRate: 45, // percent - way too high  developerIntegrationTime: 3, // weeks - should be hours  supportTicketsPerWeek: 150, // related to notifications  marketingCampaignSetupTime: 2, // days - should be minutes  monthlyOperationalCost: 25000, // dollars - 5x the estimate  engineersPagedPerWeek: 12 // times - unsustainable};

Key Implementation Insights

Several patterns emerge consistently across notification system implementations:

1. RFCs as Starting Hypotheses

Treating RFCs as initial hypotheses rather than fixed specifications enables better adaptation. Documents should evolve with implementation learning rather than remaining static reference points.

2. Planning for Emergent Requirements

Significant buffer allocation for unexpected requirements reflects implementation reality. Doubling estimates and adding contingency helps accommodate discovery during development.

3. Evolution-Ready Design

Systems inevitably require migration, versioning, and compatibility features. Building these capabilities early reduces future technical debt and operational complexity.

4. Edge Cases as Core Requirements

Scenarios discussed during design reviews typically manifest in production. Planning for these cases during initial implementation proves more efficient than reactive fixes.

5. Organizational Context Integration

Technical design success depends on organizational alignment. Team changes, restructuring, and varying stakeholder priorities affect implementation more than architectural elegance.

6. Operational Observability Focus

Effective monitoring addresses incident response needs rather than design documentation requirements. Business impact, user experience, and operational detail provide more valuable debugging information.

Bridging Design and Implementation

Several strategies help minimize the RFC-to-production gap:

Progressive Feature Development

Starting with well-executed core functionality enables better iteration than comprehensive initial implementation. Perfect email notifications provide a stronger foundation than basic multi-channel support.

Adaptability Over Optimization

Systems designed for graceful evolution handle changing requirements better than those optimized for predicted scenarios. Flexibility often proves more valuable than initial perfection.

Developer Experience Investment

Easy integration and operation drive adoption more effectively than raw performance. API usability often determines system success more than technical specifications.

Documentation Evolution

Maintaining documentation as living artifacts rather than historical records improves team understanding. Sections for original design, current implementation, and learned insights provide comprehensive context.

Comprehensive Feedback Integration

Feedback loops across user experience, operational metrics, and developer workflow enable rapid iteration. Quick learning cycles accelerate problem identification and resolution.

Conclusion: Embracing Implementation Reality

Learning to work with implementation evolution rather than against it improves outcomes. Pristine RFCs naturally become complex as they address user needs. Beautiful architectures develop practical extensions. Clean codebases accumulate necessary technical debt. This represents successful problem-solving rather than design failure.

The RFC-to-production gap requires management rather than elimination. Effective engineering adapts to emerging reality while maintaining system coherence and user value.

Reflecting on notification system implementations, final systems rarely match initial designs. They're typically more complex and take longer to build, but they're also more capable and solve problems that weren't apparent during initial planning.

When writing RFCs, remember: you're starting a conversation with implementation reality rather than defining fixed specifications. This perspective enables better planning and more realistic expectations.

References

Related Posts