Kafka or Event Bus? Signals That Push You Off SNS/SQS/EventBridge
Named signals that justify a Kafka migration from a managed event bus, and a four-phase outbox-anchored playbook to move without rip-and-replace.
Most teams running on SNS+SQS, EventBridge, Pub/Sub, or Service Bus eventually hit a wall and ask whether they should move to Kafka. The hard part is that "it depends" answers never enumerate which conditions actually matter, so the decision drifts toward whichever vendor doc was read last. Kafka solves durability, replay, ordering, and sustained throughput; stay on the managed event bus until two or more concrete signals fire, then migrate incrementally with a transactional outbox rather than a rip-and-replace.
This post is written for backend engineers and cloud architects running an event bus in production today. It names seven signals, the counter-signals that should keep you on the bus, and a four-phase migration playbook anchored on the outbox pattern.
Signals That Push You to Kafka
Each signal follows the same shape: what you observe, why the managed bus can't solve it, and what Kafka gives you instead. One signal is usually a patch. Two or more is a migration.
Signal 1: You need replay or time-travel
The trigger looks like one of these: debugging a bug that consumed events without recording them, onboarding a new consumer mid-stream and needing the last seven days of history, or a regulatory request to reprocess a window of events with the current code path.
SQS retains unacknowledged messages for up to 14 days when configured to the maximum (default 4 days), but the moment a consumer acknowledges and deletes a message it is gone; there is no replayable immutable log to seek back through. SNS does not store messages at all. EventBridge has Archive and Replay, which goes some of the way: archives support long-term, configurable retention (set RetentionDays=0 for the maximum), and replays let you push them back through your rules. The catch is documented by AWS: replays are multi-threaded and events may be delivered in a different order than they were originally archived. Replay-with-ordering is not the same product as Kafka's offset-based seek.
Kafka models events as an append-only log per topic with tunable retention. Consumers seek by offset or by timestamp; ordering is preserved within a partition. If you need to reprocess seven days of events for a new consumer and you need them in the original order, the bus answer is "rebuild it on top with archives, S3, and orchestration code." The Kafka answer is seek(offset).
Signal 2: Strict per-key ordering at sustained throughput
The trigger is contention between two requirements: events for the same entity (account, order, device) must be processed in source order, and your throughput is climbing.
SQS FIFO defaults to 300 transactions per second per API action without batching, and 3,000 with batching. High-throughput mode for FIFO raises this to a regional ceiling: 70,000 TPS in us-east-1, us-west-2, and eu-west-1; 19,000 in us-east-2 and eu-central-1; 9,000 across the four Asia Pacific regions; 4,500 in London and São Paulo; and 2,400 in every other region. The mode is enabled via a console toggle that also sets the deduplication scope to "Message group" and the throughput limit to "Per message group ID." For many workloads this is enough. For others it is not, and there is no further dial.
EventBridge is more direct: the bus itself does not provide message ordering guarantees and delivers events to targets in an arbitrary order. EventBridge Pipes is a different integration primitive: for ordered sources like Kinesis or DynamoDB Streams it can preserve source ordering, but the bus-rule fan-out path cannot.
Kafka preserves order per partition, provided the producer is configured for it: enable.idempotence=true (default since Kafka 3.0) and max.in.flight.requests.per.connection ≤ 5 (also the default). Without idempotence, in-flight retries can reorder messages within a partition. Partitions scale horizontally within a topic, and per-key ordering holds as long as the key always hashes to the same partition. Per-key ordering at five-figure-per-second sustained throughput is what the partitioning model is for.
This is mechanically different from SQS FIFO. SQS FIFO serializes in-flight delivery within a MessageGroupId: while message C is held under its visibility timeout, messages D and E in the same group are not delivered. Kafka has no queue-style in-flight serialization. The broker stores messages in produce order; consumers read each partition's log sequentially, and a slow consumer on message C only blocks its own offset progress, not other consumers' view of the partition.
The two corner cases where Kafka does enforce a gap are not delivery serialization. With idempotent producers (enable.idempotence=true), the broker rejects out-of-order producer retries with OutOfOrderSequenceException. With isolation.level=read_committed, the consumer's Last Stable Offset hides messages from later-committed transactions until earlier transactions commit or abort. Both are producer-side or transactional guarantees, not the per-key head-of-line blocking SQS FIFO gives you.
Signal 3: Many independent consumers reading at different paces
When the same event stream feeds several teams or downstream systems, and one of them is slow or restarts often, you need each consumer to track its own progress without backpressuring the others.
The SQS pattern works: SNS fans out to multiple SQS queues, each consumer owns a queue, and a slow consumer just builds up its own backlog. The cost shape changes the calculus at scale. Per-message pricing compounds linearly with the number of consumers fanning out the same event, because each fan-out costs another SQS message.
Kafka consumer groups read independently from the same log. Adding a fourth consumer does not duplicate storage; it adds another offset pointer to the same partition data. Each consumer still incurs fetch, broker egress, and network cost, but you avoid the per-message delivery charge that compounds linearly with consumer count on per-message-priced buses. When you have five or more independent consumers on the same high-volume stream, the price gap widens.
Signal 4: Sustained throughput crossover on per-message pricing
Per-message buses charge per put and per delivery. Kafka charges per broker-hour (MSK Provisioned), per partition-hour and per GB (MSK Serverless), or per eCKU and per GB (Confluent Cloud). Below some sustained rate, the bus wins on price. Above it, Kafka wins.
The exact crossover depends on retention, replication factor, partition count, payload size, consumer count, and how aggressively you fan out. Recompute it from current AWS and Confluent pricing pages whenever you make this decision; benchmarks that are a year old are not evidence. Teams running well below 1,000 messages per second sustained almost never reach the crossover. Teams running tens of thousands of messages per second sustained usually have, though the threshold shifts with all of the variables above.
Signal 5: You need stream processing, not just stream delivery
The trigger is a requirement for windowed joins, sessionization, or materialized views computed from the event stream. Examples: a five-minute rolling average per device, a join between order events and shipment events over a one-hour window, or a continuously updated count of active users.
Event bus consumers cannot do windowed joins natively. You can build them on top with state stores in DynamoDB or Redis, but at that point you are reimplementing a stream processor. Kafka Streams, ksqlDB, and Apache Flink (via Amazon Managed Service for Apache Flink, formerly Kinesis Data Analytics) are designed for this. EventBridge Pipes covers basic transformation and filtering but does not provide windowed stateful operators.
If you are writing custom state-machine code to compute things that look like rolling windows, that is the signal. One caveat: Kafka's exactly-once semantics are limited in scope. They cover producer-to-broker writes and within-topology stream operators, but downstream side effects (database writes, third-party API calls, email sends) still require application-level idempotency.
Signal 6: Schema evolution under a registry, with enforcement
Producer-consumer contract drift is one of the most common production failure modes in event-driven systems. A field gets renamed, an enum value gets added, a required field becomes optional. Consumers that fail to upgrade break in subtle ways.
Confluent Schema Registry and AWS Glue Schema Registry both offer compatibility modes (BACKWARD, FORWARD, FULL) enforced at produce time. EventBridge has its own Schema Registry, but it is oriented toward discovery and code-binding generation rather than contract enforcement at produce time.
The signal is not "we want schemas." It is "schema drift is causing production incidents and we need produce-time rejection of incompatible changes."
Signal 7: Your event source is a database
When the system of record is a relational database (Postgres, MySQL) or DynamoDB and you want the database's state changes to become events, change data capture is the canonical pipeline. Debezium reads the Postgres or MySQL write-ahead log and emits one event per row change into Kafka. The result is exact, ordered, and avoids the classic application-level dual-write problem at the producer side.
EventBridge integrates well with DynamoDB Streams via Pipes, but native Postgres CDC into EventBridge is not a first-class path. If your event source is a relational database and your volume is non-trivial, Debezium into Kafka is the well-paved road.
Counter-Signals: When Kafka Is the Wrong Answer
Five conditions that should keep you on the managed bus, even if Kafka is in the air:
- Low volume. Below roughly 1,000 messages per second sustained, per-message pricing is cheap and the broker tax of even managed Kafka is not worth paying.
- Ephemeral events. If there is no replay need, no aggregation, and no per-key ordering requirement, the bus is the simpler substrate.
- Lambda-only stack with no stateful operators. Teams running serverless-only do not want to acquire Kafka operational skills as a side effect of a substrate change. Stay on the bus until the case is overwhelming.
- No SRE bandwidth for broker concerns. Even managed Kafka adds partition planning, consumer-group debugging, offset management, and rebalance investigation. If no one on the team owns that, the bus is safer.
- "Streaming UI" is the real requirement. Real-time updates to a user interface are a WebSocket or Server-Sent Events problem. Kafka does not deliver to browsers; it delivers to backends.
How You Migrate Without Rip-and-Replace
Once two or more signals fire and the decision is made, the migration is its own engineering project. Rip-and-replace cutovers fail in well-documented ways: dual systems get out of sync, rollback is impossible, and the team loses confidence in the migration mid-flight. The pattern below keeps the bus running throughout.
Phase 0: Run the bus and Kafka in parallel. Stand up the Kafka cluster, the topics, and the schema registry before any producer writes to it. Verify connectivity, ACLs, and observability. The bus stays as the source of truth.
Phase 1: Outbox plus dual-write from producers. This is the load-bearing phase. Dual-write directly from application code to both the bus and Kafka is the canonical distributed-systems anti-pattern: when one of the two writes fails, the systems diverge and you have no way to reconcile. The transactional outbox pattern, documented by Chris Richardson on microservices.io, solves this. The application writes a domain row and an outbox row in the same database transaction; a relay reads the outbox and publishes to the bus and to Kafka, retrying on failure.
A minimal outbox table looks like this:
The application inserts into outbox inside the same transaction as the business write. A relay (an app-level poller or Debezium reading the Postgres WAL via the Outbox Event Router transform) drains rows, publishes to both substrates, and stamps published_at. Either both writes happen or neither does, because they are both consequences of the same committed transaction.
Phase 2: Migrate consumers one event type at a time, behind a flag. Pick one event type, the lowest-risk one. Stand up the new consumer reading from Kafka. Verify equivalence against the existing SQS or EventBridge consumer: message counts match, sampled payloads diff clean, downstream side effects are idempotent. Flip the flag, monitor, then move on to the next event type. Do not move two event types simultaneously.
Phase 3: Cut over producers per event type. For each event type whose consumers have been fully migrated, stop publishing it to the bus. The outbox now writes only to Kafka for that type. Bus traffic for the migrated types drops to zero.
Phase 4: Decommission. Remove the bus subscription, the bus topic or rule, and finally the outbox row template for the migrated event type. Now the migration is done for that type.
On managed versus self-hosted: default to managed Kafka on the first migration. The operational learning curve and the migration risk compound badly. MSK Serverless removes broker sizing entirely; MSK Provisioned gives more control; Confluent Cloud, Aiven, and Redpanda Cloud trade lock-in surface for managed convenience. Self-hosting on Kubernetes via Strimzi is a separate decision, justified only once you understand exactly what you would change.
Closing
If two or more of the seven signals describe your system, plan the migration: outbox plus dual-write, per-consumer cutover, managed Kafka first. If only one signal fires, patch the bus first; EventBridge Archive, SQS FIFO high-throughput mode, and DynamoDB-based per-key ordering shims exist for a reason. If none fire, the bus is the right answer; managed Kafka still carries operational overhead that does not pay back without the signals to justify it.
The honest first step before any architecture discussion is to list which of the seven signals your system actually fires today, with evidence.
References
- Apache Kafka Documentation: Design - Primary source on log retention, partitioning, and consumer-group ordering guarantees.
- Jay Kreps, The Log: What every software engineer should know - Canonical essay on why Kafka models a log, not a queue.
- AWS: High throughput for FIFO queues in Amazon SQS - Documents the regional high-throughput FIFO ceilings (70,000 TPS in three regions, lower elsewhere) and deduplication scope semantics.
- AWS: Amazon SQS message quotas - Source for the 300 and 3,000 messages per second FIFO defaults.
- AWS: Amazon EventBridge Archive and Replay - Documents configurable archive retention, 90-day replay deletion, and the unordered-replay caveat.
- AWS: Amazon EventBridge quotas - PutEvents rate and account-level limits.
- AWS: Amazon SQS, Amazon SNS, or EventBridge? Decision Guide - AWS-side framing of the managed-bus boundary.
- Chris Richardson, Pattern: Transactional outbox - Canonical reference for the migration's Phase 1 pattern.
- Debezium: Outbox Event Router - CDC-based outbox implementation for Postgres or MySQL.
- AWS: Amazon MSK Serverless features - Per-cluster and per-partition throughput numbers for the managed Kafka option.
- AWS: Amazon MSK pricing - Pricing model for MSK Provisioned and Serverless; recompute the crossover at decision time.
- Confluent Cloud pricing - eCKU pricing model for the managed Confluent option.
- AWS: AWS Glue Schema Registry - Compatibility modes for schema evolution on the AWS-native stack.
- AWS Lambda: Apache Kafka event poller scaling modes - Provisioned Mode, per-partition poller limits, and partition-bound ordering.