This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Kafka core concepts — append-only log, partitions, offsets, consumer groups, pull vs push, replay, and log compaction. This IS the distributed log interview. |
| Arch 50 | Add replication, leader election, producer batching, and retention vs compaction policy selection. |
| Arch 75 | Staff: discuss controller failover, unclean leader election trade-offs, and when compaction beats retention. |
Interview Prompt
Design a distributed message broker (like Apache Kafka) that stores messages in an append-only log, supports multiple consumer groups reading independently, allows replay from any offset, and provides both time-based retention and log compaction for keyed topics.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Pull or push consumption model? | Pull (Kafka) gives consumer-controlled pace; push (RabbitMQ) gives broker-controlled delivery. |
| Retention policy — time-based, size-based, or compaction? | Event log (7-day retention) vs changelog (compacted, keep latest per key) — different storage engines. |
| Ordering guarantees needed? | Per-partition ordering is free; global ordering requires single partition (bottleneck). |
| Expected throughput and retention period? | 1M msg/sec × 7 days drives disk sizing and partition count. |
Scope
In scope
- Append-only commit log architecture
- Partitions and offsets
- Consumer groups with independent offset tracking
- Pull-based consumption
- Message replay from arbitrary offset
- Log compaction for keyed topics
Out of scope (state explicitly)
- Kafka Streams / ksqlDB processing
- Full ZooKeeper/KRaft controller implementation details
- Exactly-once distributed transactions across systems
Assumptions
- 1M messages/sec peak, 7-day retention default
- Per-partition ordering required
- Average message size: 2 KB
- Mixed event streams (retention) and config changelogs (compaction)
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Publish: Producers publish messages to named topics.
- Subscribe: Consumers subscribe to topics and receive messages in strict partition-level order.
- Persistence: Messages durably stored on disk for a configurable retention period (time or size-based).
- Consumer Groups: Multiple consumers in a group share processing load; each message goes to exactly one group member.
- Ordering: Messages within a partition are strictly ordered (FIFO) via monotonically increasing offsets.
- Delivery Semantics: Support at-least-once, at-most-once, and exactly-once delivery guarantees.
- Replay: Consumers can re-read historic messages by seeking to any offset or timestamp.
- Partitioning: Topics split into partitions for parallelism, load distribution, and horizontal scaling.
- Log Compaction: Retain only the latest value per key, essential for changelogs and CDC use cases.
- Schema Evolution: Support schema validation and evolutionary compatibility checks via an external Schema Registry.
- High Throughput: 1M+ messages/sec per broker (multiple GBs per second across the cluster).
- Low Latency: End-to-end latency < 10 ms (producer to consumer write-to-read path).
- Durability: Zero message loss during leader or node broker failures (using
acks=alland ISR consensus). - Horizontal Scalability: Dynamically add brokers to scale partitions, storage, and throughput linearly.
- Fault Tolerance: Survive broker, rack, and even Availability Zone (AZ) failures with rack-aware replication.
- Backpressure: Bounded pull model naturally protects slow consumers from getting overwhelmed.
- Multi-Tenancy: Quotas (bandwidth, rate limits) per client-id to prevent noisy-neighbor congestion.
| Metric | Calculation | Value |
|---|---|---|
| Messages / sec (system-wide) | 10 Million | |
| Avg message size | 1 KB | |
| Throughput (write) | 10M × 1 KB | 10 GB/sec |
| Throughput (read) | 3 consumer groups × 10 GB/s | 30 GB/sec |
| Retention Period | 7 Days | |
| Storage per day (raw) | 10 GB/s × 86400 | ~864 TB |
| Storage for 7 days | 864 TB × 7 | ~6 PB |
| Replication factor 3 | 6 PB × 3 | ~18 PB total storage |
| Brokers (12 TB usable each) | 18 PB / 12 TB | ~1500 brokers |
| Topics count | 10,000 | |
| Partitions (total) | 500,000 | |
| Network per broker | (write + replicate + read) ÷ 1500 | ~40 MB/s |
Producer network: 10 GB/s ingress across cluster Replication network: 10 GB/s × 2 (RF=3 means 2 copies) = 20 GB/s intra-cluster Consumer network: 30 GB/s egress (multiple consumer groups) Total cluster I/O: ~60 GB/s → requires 25GbE or 100GbE NICs Per broker (1500 brokers): Write: ~7 MB/s per broker (easily handled) Replicate: ~14 MB/s Serve reads: ~20 MB/s Disk: sequential writes at ~200 MB/s per SSD → comfortable headroom
The Kafka HLD decouples producers, brokers, and consumers. Producers bypass API gateways and publish directly to partition leaders. Metadata is managed via KRaft (Kafka Raft consensus) instead of ZooKeeper.
1. Topics and Partitions ⭐
Topics are divided into partitions for parallelism. Partitions enable horizontal scale-out as different partitions are spread across different brokers.
Topic: user-events (3 partitions) Partition 0: [msg0, msg1, msg2, msg3, ...] → Offset 0, 1, 2, 3 Partition 1: [msg0, msg1, msg2, ...] → Offset 0, 1, 2 Partition 2: [msg0, msg1, ...] → Offset 0, 1 Key insight: ordering is ONLY within a partition, NOT across partitions. - All events for user_id=123 go to same partition → ordered. - Events across user_id=123 and user_id=456 have NO ordering guarantee.
2. Broker Storage Engine: Why Append-Only is Fast ⭐
Kafka's core innovation is treating log data as an append-only ring buffer of segment files.
- Sequential I/O: Sequential writes (600 MB/s on SSD) are 10-100x faster than random I/O.
- OS Page Cache: The JVM heap is kept small (~6 GB). The rest of RAM acts as the OS Page Cache. Read paths hit memory instead of disk.
- Zero-Copy (sendfile): Transfers data directly from kernel space to network cards, bypassing user-space copies (saves 2 copies and CPU cycles).
- Batched I/O: Groups multiple messages before flushing to amortize fsync overhead.
/data/user-events-0/ 00000000000000000000.log (segment 1: offsets 0 – 999,999) 00000000000000000000.index (sparse index: maps offset → byte position) 00000000000000000000.timeindex 00000000000001000000.log (segment 2: offsets 1,000,000 – 1,999,999) 00000000000001000000.index leader-epoch-checkpoint (tracks leader changes for truncation)
Message Lookup by Offset Walkthrough
Consumer requests offset 1,500,042: 1. Binary search segment files by name → find segment starting at 1,000,000 2. Open 00000000000001000000.index → binary search for ≤ 1,500,042 → Index entry: offset 1,500,000 → file position 483,200 3. Seek to position 483,200 in .log file 4. Scan forward 42 messages → found offset 1,500,042 Total: ~2 disk seeks + short scan → < 1ms from page cache
3. ZooKeeper vs KRaft (Metadata Management) ⭐
Metadata coordination has migrated from separate ZooKeeper ensembles to integrated Kafka Raft (KRaft) control nodes.
ZooKeeper (legacy, being removed): - Stores: broker registry, topic configs, partition-to-leader mapping, ACLs - Handles leader election for partitions - Bottleneck for large clusters (100K+ partitions) and separate operational burden KRaft (Kafka Raft — production-ready since Kafka 3.3): - Kafka's own Raft-based consensus layer for metadata - Metadata stored in internal topic: __cluster_metadata - Controller quorum: 3 or 5 brokers elected as controllers Active controller = Raft leader → handles all metadata changes Standby controllers = Raft followers → take over on failure Benefits over ZooKeeper: 1. No external dependency → simpler operations 2. Metadata changes are faster (single round-trip vs ZK multi-step) 3. Scales to millions of partitions (ZK struggled at ~200K) 4. Metadata is a Kafka log → can snapshot + replay (familiar model)
4. Replication Consensus (In-Sync Replicas) ⭐
Replication ensures high availability and zero data loss.
- Leader: Serves all producer writes and consumer reads.
- Followers: Continuously pull (fetch) from leader to replicate logs.
- ISR (In-Sync Replicas): Followers caught up within
replica.lag.time.max.ms(default 30s). - High Watermark (HW): The offset up to which all ISR replicas have matched. Consumers only see committed data (up to HW).
ISR Commit Offset Example
Partition 0 (Topic: order-events):
Leader: Broker 1 (Rack A) offset: 5,000,150 ← Serves all clients
Follower: Broker 4 (Rack B) offset: 5,000,148 ← In ISR (2 behind, within threshold)
Follower: Broker 7 (Rack C) offset: 5,000,150 ← In ISR (fully caught up)
ISR = {Broker 1, Broker 4, Broker 7}
High Watermark (HW) = min(ISR offsets) = 5,000,148
→ Consumers can only read up to HW (5,000,148)
→ Messages 5,000,149 and 5,000,150 are "uncommitted" (not yet replicated everywhere)5. Rack-Aware Replication ⭐
Rack-aware replica allocation protects against datacenter and rack-level network or power failures.
broker.rack=us-east-1a (Broker 1, 4) broker.rack=us-east-1b (Broker 2, 5) broker.rack=us-east-1c (Broker 3, 6) Partition 0 replicas placed on: Broker 1 (1a), Broker 2 (1b), Broker 3 (1c) → Survives any single AZ failure Without rack-awareness: replicas might all land on Broker 1, 4 (same rack) → Rack failure = ALL replicas lost → unrecoverable data loss
6. Producer Write Path ⭐
- acks=0: Fire-and-forget (fastest, high data-loss risk).
- acks=1: Await leader confirmation (vulnerable if leader dies before replicating).
- acks=all: Await all ISR confirmations (safest, zero data loss with
min.insync.replicas=2).
Producer Partitioner Strategies
1. Key-based (default): hash(key) % numPartitions ✅ Same key → same partition → ordering guaranteed ❌ Hot key → hot partition (e.g. viral user's writes saturate one partition) 2. Round-robin (no key): distribute events evenly across partitions ✅ Even load distribution ❌ No ordering guarantees for any entity 3. Sticky partitioner (Kafka 2.4+): batch to a random partition, stick until batch is full ✅ Maximizes batch size → better compression, fewer network requests ❌ No ordering guarantees 4. Custom partitioner: application-specific routing logic - Example: route by tenant_id to dedicated isolated partitions to enforce SLA bounds.
7. Consumer Read Path ⭐
Consumers poll brokers for data and maintain offsets in the internal __consumer_offsets topic.
8. Consumer Group Rebalancing ⭐
- Eager Rebalance (Legacy): Pauses all consumers in the group, revokes all partitions, and assigns them from scratch ("Stop the World" pause).
- Cooperative Incremental (Modern): Only revokes/assigns affected partitions. Unaffected consumers process uninterrupted.
- Static Membership (group.instance.id): Tells coordinator pod is in Kubernetes; brief pod restarts do not trigger rebalance.
Producer API Example (Java)
// Synchronous send (wait for ack)
ProducerRecord<String, String> record =
new ProducerRecord<>("order-events", orderId, orderJson);
RecordMetadata meta = producer.send(record).get(); // blocks until ack
log.info("Sent to partition {} at offset {}", meta.partition(), meta.offset());
// Asynchronous send (non-blocking, callback on completion)
producer.send(record, (metadata, exception) -> {
if (exception != null) {
log.error("Send failed: {}", exception.getMessage());
// Retry logic or dead-letter handling
}
});
// Transactional send (exactly-once across topics)
producer.beginTransaction();
producer.send(new ProducerRecord<>("orders", key, value));
producer.send(new ProducerRecord<>("inventory", key, inventoryUpdate));
producer.sendOffsetsToTransaction(offsets, consumerGroupId); // atomic with consume
producer.commitTransaction();Consumer API Example (Java)
consumer.subscribe(List.of("order-events")); // subscribe by topic name
// OR: consumer.assign(List.of(new TopicPartition("orders", 0))); // manual partition
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processOrder(record.key(), record.value(), record.offset());
}
consumer.commitSync(); // commit after successful processing
}
// Seek to specific offset (replay from a point)
consumer.seek(new TopicPartition("orders", 0), 5000000L);
// Seek to timestamp
consumer.offsetsForTimes(Map.of(tp, targetTimestamp));Admin API Example
// Create topic with rack-aware placement
admin.createTopics(List.of(
new NewTopic("order-events", 12, (short) 3) // 12 partitions, RF=3
.configs(Map.of(
"retention.ms", "604800000", // 7 days
"min.insync.replicas", "2",
"compression.type", "lz4"
))
));
// Reassign partitions (for broker decommission / rebalancing)
admin.alterPartitionReassignments(reassignments);
// Describe cluster
admin.describeCluster(); // brokers, controller, cluster IDRecord Batch Format (On-Disk Segment Layout)
Producers serialize and pack records into a compact batch layout to maximize disk and network compression.
Sparse Segment Index Layout
Offset Position (file byte offset) 0 0 4096 32768 ← Entry every ~4 KB of log data 8192 65536 12288 98304 Lookup offset 10000: 1. Binary search index → nearest ≤ entry = 8192 at position 65536 2. Seek to 65536 in log, scan forward to offset 10000.
Consumer Offset Storage
Topic: __consumer_offsets (50 partitions, compacted)
Partition key: hash(group_id) % 50
Key: {group_id: "order-processor", topic: "orders", partition: 7}
Value: {offset: 5000150, metadata: "", commit_timestamp: 1710403200000}
Compacted: only latest offset per (group, topic, partition) is retained in database segments.| Failure Case | System Solution Design |
|---|---|
| Broker Leader Crash | KRaft elects a new leader from the partition's ISR list; client metadata is refreshed. |
| Message Loss | Enforce acks=all + min.insync.replicas=2 + replication.factor=3 for durability. |
| Consumer Failure | Group Coordinator triggers partition rebalancing to re-assign tasks to active consumers. |
| AZ or Rack Failure | Rack-aware partition replica allocation (broker.rack) distributes data copies across physical racks/AZs. |
| Split-Brain Controllers | KRaft consensus maintains single leader metadata log with strict epoch fencing. |
| Duplicate Retries | Idempotent producer validates Producer ID (PID) + Sequence Number to filter duplicates. |
| Disk Failure | Data replicated on 2 other brokers in different racks; replace disk and re-replicate from surviving replicas. |
| Network Partition | ISR set automatically shrinks: followers removed if they can't catch up within time threshold. Prevents stale writes/reads. |
| Unclean Leader Election | Set unclean.leader.election.enable=false to prevent out-of-sync replicas from becoming leader. |
| Slow Consumer | Event lag grows but does not impact producers or brokers since consumer pulls at its own pace. |
1. Exactly-Once Semantics (EOS) Framework ⭐
Kafka supports three primary delivery guarantees: At-Most-Once, At-Least-Once, and Exactly-Once (EOS).
- At-Most-Once:
acks=0, no producer retries, or consumer committing offset BEFORE processing. Crash results in lost messages. - At-Least-Once:
acks=all, infinite producer retries, and consumer committing offset AFTER processing. Crash results in potential duplicates. - Exactly-Once: Combines an Idempotent Producer (sequence numbers and PID checks) with Transactional consume-transform-produce loops. Downstream consumers must configure
isolation.level=read_committed.
2. Race Condition: Stale Leader split-brain (Fencing via Epochs) ⭐
If an isolated leader continues accepting writes after being replaced, divergence occurs. Kafka prevents this using Leader Epochs.
T=0: B1 is leader, epoch=5. All writes include epoch=5.
T=10s: Network isolates B1. Controller elects B2 as leader, epoch=6.
T=11s: B1's network recovers.
- Producer attempts to send message to B1 with epoch=5.
- B1 checks request: my epoch < current → REJECT.
- B1 learns new epoch=6, truncates uncommitted log entries, and steps down to follower.3. Race Condition: Consumer Offset Commit Gap ⭐
If a consumer crashes after processing but before committing offsets, it will reprocess on restart, violating exactly-once guarantees.
Solutions:
- Idempotent Consumer: Make downstream actions idempotent (e.g. UPSERT with unique constraint:
INSERT INTO results (id, data) VALUES ('orders-7-50', ...) ON CONFLICT DO NOTHING). - Kafka Transactions (consume-transform-produce): Commit offsets and write output atomically in a single transaction.
4. ISR Shrinkage & The Data Loss Window ⭐
ISR (In-Sync Replicas) are followers caught up with the leader within replica.lag.time.max.ms (default 30s).
Config: acks=all, min.insync.replicas=2, replication.factor=3
T=0: All healthy. ISR = {B1, B2, B3}. acks=all means all 3 ACK.
T=10s: Broker3 hits GC pause, falls behind > 30 seconds.
T=10s: Controller removes B3 from ISR. ISR = {B1, B2}.
T=10s: acks=all now means acks={B1, B2} only (reduced redundancy, still safe).
T=15s: Broker1 (leader) crashes. B2 has all committed data → becomes leader.
T=15s: B3 recovers → truncates uncommitted data to High Watermark and re-fetches.
BUT if min.insync.replicas=1 (DANGEROUS CONFIG):
- ISR shrinks to {B1} (just the leader).
- acks=all effectively behaves as acks=1.
- If B1 crashes, un-replicated messages are lost forever.
Durability Recommendation:
Set: acks=all + min.insync.replicas=2 + replication.factor=35. Follower Fetch Protocol ⭐
Followers pull data from the partition leader using a continuous loop:
while (true) {
FetchRequest to leader: {partition: 0, fetch_offset: 5000148, max_bytes: 1MB}
Leader responds: {messages from offset 5000148..5000200, leader_epoch: 5, HW: 5000145}
Follower appends messages to local log
Follower advances local High Watermark to min(leader_HW, local_log_end)
Repeat immediately (no delay for real-time replication)
}
Why Pull over Push?
1. Follower controls pace → natural backpressure
2. Simpler: leader doesn't track each follower's state
3. Follower crash → just stops fetching → leader doesn't care
4. Follower recovery → starts fetching from last offset → catches up automatically
Specs: typical same-DC replication lag is 1-5ms (sub-millisecond when colocated).6. Controlled Shutdown vs Uncontrolled Shutdown
- Controlled: Admin requests shutdown. Leader moves active partition leadership off the broker BEFORE stopping. Zero downtime.
- Uncontrolled: Broker crashes. Controller takes up to 10s to detect heartbeat timeout, causing temporary partition unavailability.
7. Broker Decommission and Partition Reassignment
To safely remove a broker (e.g., hardware refresh):
- Generate a partition replica reassignment plan via the Admin tool.
- Execute reassignment: New replicas fetch full logs from partition leaders in the background.
- Apply bandwidth throttling (
limit.bytes.per.second) to prevent cluster network starvation during transfer. - Once caught up, the new node enters the ISR set, and the decommissioned broker's replicas are removed.
- Safely shut down the old broker (now containing 0 active replicas).
1. Schema Registry & Compatibility Modes ⭐
Decoupling producer and consumer deployments requires strict schema validation to prevent structural breaking changes.
| Compatibility Mode | Evolution Rule | Deployment Order |
|---|---|---|
| BACKWARD (Default) | New schema can read data written by old schemas. Can add optional fields, delete fields. | Upgrade Consumers first, then upgrade Producers. |
| FORWARD | Old schema can read data written by new schemas. Can add fields, delete optional fields. | Upgrade Producers first, then upgrade Consumers. |
| FULL | Schemas are fully bi-directional compatible. Can only add/delete optional fields with defaults. | Deploy independently in any order. |
Serialization Format Comparison
Avro vs Protobuf vs JSON Schema: - Avro: Compact binary, schema evolution built-in, Kafka's native industry choice. - Protobuf: Compact binary, strong typing, popular in gRPC microservices ecosystems. - JSON Schema: Human-readable, larger payload footprint, easiest for visual debugging. Recommendation: Use Avro for general Kafka environments due to Confluent's extensive native tooling, or Protobuf if already using a gRPC microservices ecosystem.
Wire Message Binary Layout
2. Tiered Storage (Solving Cost at Petabyte Scale)
Holding petabytes of historical logs on fast local NVMe SSDs is cost-prohibitive. Tiered storage separates computing and storage.
- Hot Tier (Local Disk): Holds last 4-24 hours for real-time consumers (reads hit page cache / SSDs).
- Cold Tier (S3 / Object Store): Automatically uploads closed segment files to cheap cloud storage (S3/GCS) for long-term retention. Saves ~75% storage costs.
3. Kafka Connect: CDC and Database Integrations
Instead of writing custom services to move data in/out of Kafka, Kafka Connect provides a robust, distributed connector runtime.
Supported Connectors catalog out-of-the-box: - Source Connectors: * Debezium (MySQL/PostgreSQL CDC log tailing to Kafka) ⭐ * JDBC Source (poll-based databases query streaming) * S3 Source, Kinesis Source, ActiveMQ/RabbitMQ Source - Sink Connectors: * Elasticsearch/OpenSearch Sink (instant search indexing) * S3 / GCS / Azure Blob Sinks (long-term data lake archival) * Snowflake, BigQuery, and Redshift Data Warehouse Sinks * JDBC Relational DB Sinks Why not just write a custom consumer? - Connect handles: offset management, task parallelism, worker load balancing, fault tolerance, monitoring, schema registry auto-integration, and transactional exactly-once flows. - Debezium (CDC) connector takes 10 minutes to configure; writing a bulletproof CDC from scratch takes months.
4. Kafka Streams vs. Apache Flink
For real-time transformations, aggregations, and streaming joins, choose the right processor engine.
StreamsBuilder builder = new StreamsBuilder();
builder.stream("orders")
.filter((k, v) -> v.amount > 100)
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.count()
.toStream()
.to("high-value-order-counts");
// Key Benefits:
// - No separate cluster needed — just a Java library in your app
// - Exactly-once semantics (built on Kafka transactions)
// - RocksDB-backed state stores with changelog topic recovery
// - Simple deployment (just deploy your application jar)- Kafka Streams: Run as a lightweight Java library in your app. Excellent for pure Kafka-to-Kafka streaming and simple microservice transforms.
- Apache Flink: Separate dedicated compute cluster. Required for multi-source joins, out-of-order events (watermarks), and complex event processing at scale.
5. Monitoring & Alerting
Keep a close eye on the health of your message broker using these alerts:
- Critical Alerts:
- Under-replicated partitions > 0 (Data redundancy at risk)
- Offline partitions > 0 (Topic partition unavailable)
- ISR shrink rate > threshold (Replicas falling out of sync)
- Consumer lag > threshold (Consumers falling behind)
- Warning Alerts:
- Disk usage > 80% on any broker node
- p99 request latency > 100ms
- Request handler idle ratio < 30% (Broker CPU overloaded)
- Unclean leader election count > 0
6. Performance Tuning Cheat Sheet (Staff Level)
Producer tuning: batch.size=65536 (64 KB) → larger batches = higher throughput linger.ms=10 → wait 10ms to fill batch compression.type=lz4 → best throughput/compression trade-off buffer.memory=67108864 (64MB) → buffer for async sends acks=all → durability (always for important data) Consumer tuning: fetch.min.bytes=1048576 (1 MB) → wait for 1 MB before returning fetch.max.wait.ms=500 → max wait for min.bytes max.poll.records=500 → messages per poll() call max.poll.interval.ms=300000 (5min) → max time between polls before considered dead Broker tuning: num.io.threads=16 → disk I/O threads (match # of disks) num.network.threads=8 → network threads (match # of NICs) socket.send.buffer.bytes=1048576 → 1 MB socket buffer log.segment.bytes=1073741824 (1GB) → segment size log.retention.hours=168 (7 days) → retention period num.partitions=12 → default partitions per topic Golden rules: - Partitions per broker < 4000 (beyond = high metadata overhead) - Partition count = max(producer_throughput / single_partition_throughput, consumer_count)
Interview Walkthrough
- Start with the log abstraction: an append-only, partitioned commit log — producers append, consumers read at their own offset.
- Explain partitions: key-based routing preserves ordering per key; more partitions = higher parallelism but more consumer coordination.
- Cover consumer groups: each partition assigned to exactly one consumer in the group — rebalance on consumer join/leave.
- Compare delivery semantics: at-most-once (commit offset before process), at-least-once (default), exactly-once (idempotent producer + transactional consumer).
- Discuss retention and replay: brokers retain messages for days — consumers can rewind offsets to reprocess historical events.
- Address replication: leader-follower per partition, ISR (in-sync replicas), min.insync.replicas for durability vs availability trade-off.
- Common pitfall: assuming global ordering across all messages — only per-partition ordering is guaranteed; cross-partition order requires a single partition or application-level sequencing.
1. Broker Technology: Kafka vs RabbitMQ vs Pulsar
Choose the broker model that matches your processing pipeline:
| Dimension | Kafka | RabbitMQ | Apache Pulsar |
|---|---|---|---|
| Storage Model | Immutable append-only log | Durable transient queue (delete on ACK) | Tiered architecture (BookKeeper + S3) |
| Consumer Model | Pull (consumer tracks offsets) | Push (broker tracks delivery state) | Hybrid push/pull (cursor-per-consumer) |
| Message Replay | ✅ Yes (offset seek) | ❌ No (deleted on ACK) | ✅ Yes (cursor reset) |
| Multi-tenancy | Weak (quotas per client) | Fair (virtual hosts) | Strong (native namespaces + quotas) |
| Geo-replication | MirrorMaker2 (external tool) | Shovel/Federation plugins | Built-in (native out of the box) |
| Latency p99 | Low (5-15ms) | Very low (1-3ms) | Low (5-15ms) |
2. Consumer Group Rebalancing: The Throughput Killer
Rebalancing triggers Stop-The-World pauses for consumer groups, creating major latency spikes.
EAGER REBALANCING (legacy — "Stop the World"):
T=0: Consumer B crashes
T=0.1s: Coordinator detects missed heartbeat (session.timeout.ms=10s default)
T=10s: Coordinator marks B as dead → triggers rebalance
T=10s: ALL consumers (A and C too!) STOP processing + revoke ALL partitions
T=10.5s: A and C re-join group, coordinator reassigns
T=11s: A → {P0,P1,P2}, C → {P3,P4,P5}
T=11s: Consumers resume
⚠️ TOTAL PAUSE: ~11 seconds. ALL partitions stopped. 5M messages delayed during rebalance!
COOPERATIVE INCREMENTAL REBALANCING (modern — preferred):
T=0: Consumer B crashes
T=10s: Coordinator detects B is dead
T=10s: Coordinator sends: "revoke ONLY B's partitions (P2, P3)"
T=10s: Consumer A keeps processing P0, P1 (NEVER STOPPED!)
T=10s: Consumer C keeps processing P4, P5 (NEVER STOPPED!)
T=10.5s: B's partitions reassigned: A → {P0,P1,P2}, C → {P3,P4,P5}
⚠️ Only B's 2 partitions paused for ~0.5 seconds. A and C NEVER stopped.
Config: partition.assignment.strategy=CooperativeStickyAssignor
STATIC GROUP MEMBERSHIP (group.instance.id) ⭐:
Consumer B restarts (rolling deploy) with same instance ID
→ Coordinator recognizes it → assigns SAME partitions → NO rebalance at all
→ Perfect for Kubernetes rolling deployments3. Choosing the Right Partition Count ⭐
Partition count is an irreversible decision (you can increase, but never decrease partitions without breaking partition-key hash ordering).
partitions = max( target_throughput / throughput_per_partition, max_consumer_count_per_group ) Example: Target: 100 MB/s for topic. Single partition: ~10 MB/s throughput. → 10 partitions minimum for throughput. But we want 20 consumers for processing → need 20 partitions. → Choose 20 partitions.
- Too Few: Limits consumer scaling and throughput; creates single broker hotspots.
- Too Many: Exhausts file descriptors (each segment has .log, .index, .timeindex files), delays controller elections on crash, and slows down consumer rebalances.
4. Message Ordering Guarantees: Per-Partition vs Global
Kafka guarantees ordering strictly within a partition.
- Global Ordering: Configure a topic with exactly 1 partition. Ensures total order but limits consumer scale and throughput to a single worker's capacity (~100 MB/s).
- Per-Key Ordering (Default): Partition by entity ID (e.g.,
user_id). All events for a specific user land in the same partition and are processed in order. - Common Pitfalls:
- 1. Producer Retries without Idempotency:
If the producer sends a message but the network ACK is lost, retrying will write a duplicate record. E.g., message 1 fails network ACK → producer retries → message 2 already written. Result: [msg2, msg1] written out of order. Solution: Enforce
enable.idempotence=truewhere the broker uses Producer ID (PID) + sequence numbers to automatically deduplicate. - 2. Multiple Producers writing to the same partition:
If Producer A writes an event at T=1, and Producer B writes an event at T=0 but it arrives late (at T=2) due to transit congestion, the log stores them out of chronological order. Solution: Designate a single-writer manager per partition key or implement event-timestamp recovery sorting at the consumer aggregation level.
- 3. Concurrent Consumer processing (multi-threading):
If a single consumer thread pulls messages 1, 2, and 3, but delegates processing to a thread pool, thread 3 may finish before thread 1. This violates sequential ordering guarantees. Solution: Enforce strict single-threaded processing per partition within the consumer execution loop, or wrap downstream actions in highly idempotent databases.
- 1. Producer Retries without Idempotency:
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Single broker, basic log
Single Kafka broker, topics with 3 partitions. Time-based retention (7 days). Single consumer group. No replication (dev/staging only).
Key components: Single broker · Append-only log segments · Offset tracking · Time retention · Consumer groups
Move to next phase when: Production requires durability — broker failure loses data
Phase 2 — Cluster with replication
3+ broker cluster, RF=3, min.insync.replicas=2. Multiple consumer groups. Compacted topics for config/CDC. Schema registry for Avro/Protobuf evolution.
Key components: 3-broker cluster · RF=3 replication · ISR · Log compaction · Schema registry
Move to next phase when: 1M msg/sec exceeds single broker; need partition scaling and cross-region
Phase 3 — Hyperscale multi-region
100+ broker cluster, 10K+ partitions. Tiered storage (hot SSD + cold S3). MirrorMaker 2 for cross-region replication. KRaft mode (no ZooKeeper). Proactive rebalancing (Cruise Control).
Key components: 100+ brokers · Tiered storage · MirrorMaker 2 · KRaft · Cruise Control
Move to next phase when: Storage cost dominates — 7-day retention at 1M msg/sec = petabytes
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Producer ack latency (acks=all) | < 10ms p99 | Downstream services block on produce |
| Broker availability | 99.95% | Cluster survives single broker loss with RF=3 |
| Under-replicated partitions | 0 sustained | Under-replicated = data loss risk on next failure |
| Consumer fetch latency | < 50ms p99 | Includes poll interval + network + disk read |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Broker disk full — stops accepting writes | Log dir disk usage > 95%; produce requests fail with KafkaStorageException | Reduce retention.ms temporarily; add disks; enable tiered storage offload; delete unused topics; alert at 80% threshold |
| Hot partition — single partition at 10× traffic of others | Per-partition byte rate metric; one partition leader CPU 100% | Fix producer key distribution (avoid constant key); increase partitions and migrate (complex); add leader for hot partition on dedicated broker |
| Consumer group stuck — offset not advancing | Consumer lag metric flat at high value; no rebalance events | Check max.poll.interval.ms vs processing time; restart stuck consumer; skip poison offset to DLQ; verify broker availability for offset commit |
Cost Drivers (Staff lens)
- Disk storage: throughput × message_size × retention = dominant (1M msg/sec × 2KB × 7 days ≈ 1.2 PB)
- Cross-AZ replication bandwidth: 2× write traffic (leader → followers)
- Broker compute: modest compared to storage — network and disk I/O bound
Multi-Region & DR
MirrorMaker 2 for active-passive replication. Primary region serves producers and consumers; DR region has replica topics. Failover: redirect producers to DR (manual or automated). Active-active possible but requires conflict resolution for compacted topics. Offset mapping not automatic across regions — consumer must track separately.