Design a Distributed Message Broker (Kafka-style) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Kafka core concepts — append-only log, partitions, offsets, consumer groups, pull vs push, replay, and log compaction. This IS the distributed log interview.
Arch 50	Add replication, leader election, producer batching, and retention vs compaction policy selection.
Arch 75	Staff: discuss controller failover, unclean leader election trade-offs, and when compaction beats retention.

Interview Prompt

Design a distributed message broker (like Apache Kafka) that stores messages in an append-only log, supports multiple consumer groups reading independently, allows replay from any offset, and provides both time-based retention and log compaction for keyed topics.

Clarifying Questions (ask before designing)

Question	Why it matters
Pull or push consumption model?	Pull (Kafka) gives consumer-controlled pace; push (RabbitMQ) gives broker-controlled delivery.
Retention policy — time-based, size-based, or compaction?	Event log (7-day retention) vs changelog (compacted, keep latest per key) — different storage engines.
Ordering guarantees needed?	Per-partition ordering is free; global ordering requires single partition (bottleneck).
Expected throughput and retention period?	1M msg/sec × 7 days drives disk sizing and partition count.

Scope

In scope

Append-only commit log architecture
Partitions and offsets
Consumer groups with independent offset tracking
Pull-based consumption
Message replay from arbitrary offset
Log compaction for keyed topics

Out of scope (state explicitly)

Kafka Streams / ksqlDB processing
Full ZooKeeper/KRaft controller implementation details
Exactly-once distributed transactions across systems

Assumptions

1M messages/sec peak, 7-day retention default
Per-partition ordering required
Average message size: 2 KB
Mixed event streams (retention) and config changelogs (compaction)

Publish: Producers publish messages to named topics.
Subscribe: Consumers subscribe to topics and receive messages in strict partition-level order.
Persistence: Messages durably stored on disk for a configurable retention period (time or size-based).
Consumer Groups: Multiple consumers in a group share processing load; each message goes to exactly one group member.
Ordering: Messages within a partition are strictly ordered (FIFO) via monotonically increasing offsets.
Delivery Semantics: Support at-least-once, at-most-once, and exactly-once delivery guarantees.
Replay: Consumers can re-read historic messages by seeking to any offset or timestamp.
Partitioning: Topics split into partitions for parallelism, load distribution, and horizontal scaling.
Log Compaction: Retain only the latest value per key, essential for changelogs and CDC use cases.
Schema Evolution: Support schema validation and evolutionary compatibility checks via an external Schema Registry.

Metric	Calculation	Value
Messages / sec (system-wide)		10 Million
Avg message size		1 KB
Throughput (write)	10M × 1 KB	10 GB/sec
Throughput (read)	3 consumer groups × 10 GB/s	30 GB/sec
Retention Period		7 Days
Storage per day (raw)	10 GB/s × 86400	~864 TB
Storage for 7 days	864 TB × 7	~6 PB
Replication factor 3	6 PB × 3	~18 PB total storage
Brokers (12 TB usable each)	18 PB / 12 TB	~1500 brokers
Topics count		10,000
Partitions (total)		500,000
Network per broker	(write + replicate + read) ÷ 1500	~40 MB/s

Producer network: 10 GB/s ingress across cluster
Replication network: 10 GB/s × 2 (RF=3 means 2 copies) = 20 GB/s intra-cluster
Consumer network: 30 GB/s egress (multiple consumer groups)
Total cluster I/O: ~60 GB/s → requires 25GbE or 100GbE NICs

Per broker (1500 brokers):
  Write: ~7 MB/s per broker (easily handled)
  Replicate: ~14 MB/s
  Serve reads: ~20 MB/s
  Disk: sequential writes at ~200 MB/s per SSD → comfortable headroom

The Kafka HLD decouples producers, brokers, and consumers. Producers bypass API gateways and publish directly to partition leaders. Metadata is managed via KRaft (Kafka Raft consensus) instead of ZooKeeper.

Loading...

1. Topics and Partitions ⭐

Topics are divided into partitions for parallelism. Partitions enable horizontal scale-out as different partitions are spread across different brokers.

Topic: user-events (3 partitions)

Partition 0: [msg0, msg1, msg2, msg3, ...]  → Offset 0, 1, 2, 3
Partition 1: [msg0, msg1, msg2, ...]         → Offset 0, 1, 2
Partition 2: [msg0, msg1, ...]               → Offset 0, 1

Key insight: ordering is ONLY within a partition, NOT across partitions.
- All events for user_id=123 go to same partition → ordered.
- Events across user_id=123 and user_id=456 have NO ordering guarantee.

2. Broker Storage Engine: Why Append-Only is Fast ⭐

Kafka's core innovation is treating log data as an append-only ring buffer of segment files.

Sequential I/O: Sequential writes (600 MB/s on SSD) are 10-100x faster than random I/O.
OS Page Cache: The JVM heap is kept small (~6 GB). The rest of RAM acts as the OS Page Cache. Read paths hit memory instead of disk.
Zero-Copy (sendfile): Transfers data directly from kernel space to network cards, bypassing user-space copies (saves 2 copies and CPU cycles).
Batched I/O: Groups multiple messages before flushing to amortize fsync overhead.

/data/user-events-0/
  00000000000000000000.log      (segment 1: offsets 0 – 999,999)
  00000000000000000000.index    (sparse index: maps offset → byte position)
  00000000000000000000.timeindex
  00000000000001000000.log      (segment 2: offsets 1,000,000 – 1,999,999)
  00000000000001000000.index
  leader-epoch-checkpoint        (tracks leader changes for truncation)

Message Lookup by Offset Walkthrough

Consumer requests offset 1,500,042:
1. Binary search segment files by name → find segment starting at 1,000,000
2. Open 00000000000001000000.index → binary search for ≤ 1,500,042
   → Index entry: offset 1,500,000 → file position 483,200
3. Seek to position 483,200 in .log file
4. Scan forward 42 messages → found offset 1,500,042
Total: ~2 disk seeks + short scan → < 1ms from page cache

3. ZooKeeper vs KRaft (Metadata Management) ⭐

Metadata coordination has migrated from separate ZooKeeper ensembles to integrated Kafka Raft (KRaft) control nodes.

ZooKeeper (legacy, being removed):
- Stores: broker registry, topic configs, partition-to-leader mapping, ACLs
- Handles leader election for partitions
- Bottleneck for large clusters (100K+ partitions) and separate operational burden

KRaft (Kafka Raft — production-ready since Kafka 3.3):
- Kafka's own Raft-based consensus layer for metadata
- Metadata stored in internal topic: __cluster_metadata
- Controller quorum: 3 or 5 brokers elected as controllers
  Active controller = Raft leader → handles all metadata changes
  Standby controllers = Raft followers → take over on failure
  
Benefits over ZooKeeper:
  1. No external dependency → simpler operations
  2. Metadata changes are faster (single round-trip vs ZK multi-step)
  3. Scales to millions of partitions (ZK struggled at ~200K)
  4. Metadata is a Kafka log → can snapshot + replay (familiar model)

4. Replication Consensus (In-Sync Replicas) ⭐

Replication ensures high availability and zero data loss.

Leader: Serves all producer writes and consumer reads.
Followers: Continuously pull (fetch) from leader to replicate logs.
ISR (In-Sync Replicas): Followers caught up within replica.lag.time.max.ms (default 30s).
High Watermark (HW): The offset up to which all ISR replicas have matched. Consumers only see committed data (up to HW).

ISR Commit Offset Example

Partition 0 (Topic: order-events):
  Leader:    Broker 1 (Rack A)  offset: 5,000,150   ← Serves all clients
  Follower:  Broker 4 (Rack B)  offset: 5,000,148   ← In ISR (2 behind, within threshold)
  Follower:  Broker 7 (Rack C)  offset: 5,000,150   ← In ISR (fully caught up)

  ISR = {Broker 1, Broker 4, Broker 7}

  High Watermark (HW) = min(ISR offsets) = 5,000,148
  → Consumers can only read up to HW (5,000,148)
  → Messages 5,000,149 and 5,000,150 are "uncommitted" (not yet replicated everywhere)

5. Rack-Aware Replication ⭐

Rack-aware replica allocation protects against datacenter and rack-level network or power failures.

broker.rack=us-east-1a  (Broker 1, 4)
broker.rack=us-east-1b  (Broker 2, 5)
broker.rack=us-east-1c  (Broker 3, 6)

Partition 0 replicas placed on: Broker 1 (1a), Broker 2 (1b), Broker 3 (1c)
→ Survives any single AZ failure

Without rack-awareness: replicas might all land on Broker 1, 4 (same rack)
→ Rack failure = ALL replicas lost → unrecoverable data loss

6. Producer Write Path ⭐

Loading...

acks=0: Fire-and-forget (fastest, high data-loss risk).
acks=1: Await leader confirmation (vulnerable if leader dies before replicating).
acks=all: Await all ISR confirmations (safest, zero data loss with min.insync.replicas=2).

Producer Partitioner Strategies

1. Key-based (default): hash(key) % numPartitions
   ✅ Same key → same partition → ordering guaranteed
   ❌ Hot key → hot partition (e.g. viral user's writes saturate one partition)

2. Round-robin (no key): distribute events evenly across partitions
   ✅ Even load distribution
   ❌ No ordering guarantees for any entity

3. Sticky partitioner (Kafka 2.4+): batch to a random partition, stick until batch is full
   ✅ Maximizes batch size → better compression, fewer network requests
   ❌ No ordering guarantees

4. Custom partitioner: application-specific routing logic
   - Example: route by tenant_id to dedicated isolated partitions to enforce SLA bounds.

7. Consumer Read Path ⭐

Loading...

Consumers poll brokers for data and maintain offsets in the internal __consumer_offsets topic.

8. Consumer Group Rebalancing ⭐

Eager Rebalance (Legacy): Pauses all consumers in the group, revokes all partitions, and assigns them from scratch ("Stop the World" pause).
Cooperative Incremental (Modern): Only revokes/assigns affected partitions. Unaffected consumers process uninterrupted.
Static Membership (group.instance.id): Tells coordinator pod is in Kubernetes; brief pod restarts do not trigger rebalance.

Producer API Example (Java)

JAVA

// Synchronous send (wait for ack)
ProducerRecord<String, String> record = 
    new ProducerRecord<>("order-events", orderId, orderJson);
RecordMetadata meta = producer.send(record).get();  // blocks until ack
log.info("Sent to partition {} at offset {}", meta.partition(), meta.offset());

// Asynchronous send (non-blocking, callback on completion)
producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        log.error("Send failed: {}", exception.getMessage());
        // Retry logic or dead-letter handling
    }
});

// Transactional send (exactly-once across topics)
producer.beginTransaction();
producer.send(new ProducerRecord<>("orders", key, value));
producer.send(new ProducerRecord<>("inventory", key, inventoryUpdate));
producer.sendOffsetsToTransaction(offsets, consumerGroupId);  // atomic with consume
producer.commitTransaction();

Consumer API Example (Java)

JAVA

consumer.subscribe(List.of("order-events"));  // subscribe by topic name
// OR: consumer.assign(List.of(new TopicPartition("orders", 0)));  // manual partition

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        processOrder(record.key(), record.value(), record.offset());
    }
    consumer.commitSync();  // commit after successful processing
}

// Seek to specific offset (replay from a point)
consumer.seek(new TopicPartition("orders", 0), 5000000L);

// Seek to timestamp
consumer.offsetsForTimes(Map.of(tp, targetTimestamp));

Admin API Example

JAVA

// Create topic with rack-aware placement
admin.createTopics(List.of(
    new NewTopic("order-events", 12, (short) 3)  // 12 partitions, RF=3
        .configs(Map.of(
            "retention.ms", "604800000",      // 7 days
            "min.insync.replicas", "2",
            "compression.type", "lz4"
        ))
));

// Reassign partitions (for broker decommission / rebalancing)
admin.alterPartitionReassignments(reassignments);

// Describe cluster
admin.describeCluster();  // brokers, controller, cluster ID

Failure Case	System Solution Design
Broker Leader Crash	KRaft elects a new leader from the partition's ISR list; client metadata is refreshed.
Message Loss	Enforce acks=all + min.insync.replicas=2 + replication.factor=3 for durability.
Consumer Failure	Group Coordinator triggers partition rebalancing to re-assign tasks to active consumers.
AZ or Rack Failure	Rack-aware partition replica allocation (broker.rack) distributes data copies across physical racks/AZs.
Split-Brain Controllers	KRaft consensus maintains single leader metadata log with strict epoch fencing.
Duplicate Retries	Idempotent producer validates Producer ID (PID) + Sequence Number to filter duplicates.
Disk Failure	Data replicated on 2 other brokers in different racks; replace disk and re-replicate from surviving replicas.
Network Partition	ISR set automatically shrinks: followers removed if they can't catch up within time threshold. Prevents stale writes/reads.
Unclean Leader Election	Set unclean.leader.election.enable=false to prevent out-of-sync replicas from becoming leader.
Slow Consumer	Event lag grows but does not impact producers or brokers since consumer pulls at its own pace.

1. Exactly-Once Semantics (EOS) Framework ⭐

Kafka supports three primary delivery guarantees: At-Most-Once, At-Least-Once, and Exactly-Once (EOS).

At-Most-Once: acks=0, no producer retries, or consumer committing offset BEFORE processing. Crash results in lost messages.
At-Least-Once: acks=all, infinite producer retries, and consumer committing offset AFTER processing. Crash results in potential duplicates.
Exactly-Once: Combines an Idempotent Producer (sequence numbers and PID checks) with Transactional consume-transform-produce loops. Downstream consumers must configure isolation.level=read_committed.

2. Race Condition: Stale Leader split-brain (Fencing via Epochs) ⭐

If an isolated leader continues accepting writes after being replaced, divergence occurs. Kafka prevents this using Leader Epochs.

T=0:   B1 is leader, epoch=5. All writes include epoch=5.
T=10s: Network isolates B1. Controller elects B2 as leader, epoch=6.
T=11s: B1's network recovers. 
       - Producer attempts to send message to B1 with epoch=5.
       - B1 checks request: my epoch < current → REJECT.
       - B1 learns new epoch=6, truncates uncommitted log entries, and steps down to follower.

3. Race Condition: Consumer Offset Commit Gap ⭐

If a consumer crashes after processing but before committing offsets, it will reprocess on restart, violating exactly-once guarantees.

Solutions:

Idempotent Consumer: Make downstream actions idempotent (e.g. UPSERT with unique constraint: INSERT INTO results (id, data) VALUES ('orders-7-50', ...) ON CONFLICT DO NOTHING).
Kafka Transactions (consume-transform-produce): Commit offsets and write output atomically in a single transaction.

4. ISR Shrinkage & The Data Loss Window ⭐

ISR (In-Sync Replicas) are followers caught up with the leader within replica.lag.time.max.ms (default 30s).

Config: acks=all, min.insync.replicas=2, replication.factor=3

T=0:   All healthy. ISR = {B1, B2, B3}. acks=all means all 3 ACK.
T=10s: Broker3 hits GC pause, falls behind > 30 seconds.
T=10s: Controller removes B3 from ISR. ISR = {B1, B2}.
T=10s: acks=all now means acks={B1, B2} only (reduced redundancy, still safe).
T=15s: Broker1 (leader) crashes. B2 has all committed data → becomes leader.
T=15s: B3 recovers → truncates uncommitted data to High Watermark and re-fetches.

BUT if min.insync.replicas=1 (DANGEROUS CONFIG):
- ISR shrinks to {B1} (just the leader).
- acks=all effectively behaves as acks=1.
- If B1 crashes, un-replicated messages are lost forever.

Durability Recommendation:
Set: acks=all + min.insync.replicas=2 + replication.factor=3

5. Follower Fetch Protocol ⭐

Followers pull data from the partition leader using a continuous loop:

while (true) {
  FetchRequest to leader: {partition: 0, fetch_offset: 5000148, max_bytes: 1MB}
  Leader responds: {messages from offset 5000148..5000200, leader_epoch: 5, HW: 5000145}
  Follower appends messages to local log
  Follower advances local High Watermark to min(leader_HW, local_log_end)
  Repeat immediately (no delay for real-time replication)
}

Why Pull over Push?
1. Follower controls pace → natural backpressure
2. Simpler: leader doesn't track each follower's state
3. Follower crash → just stops fetching → leader doesn't care
4. Follower recovery → starts fetching from last offset → catches up automatically

Specs: typical same-DC replication lag is 1-5ms (sub-millisecond when colocated).

6. Controlled Shutdown vs Uncontrolled Shutdown

Controlled: Admin requests shutdown. Leader moves active partition leadership off the broker BEFORE stopping. Zero downtime.
Uncontrolled: Broker crashes. Controller takes up to 10s to detect heartbeat timeout, causing temporary partition unavailability.

7. Broker Decommission and Partition Reassignment

To safely remove a broker (e.g., hardware refresh):

Generate a partition replica reassignment plan via the Admin tool.
Execute reassignment: New replicas fetch full logs from partition leaders in the background.
Apply bandwidth throttling (limit.bytes.per.second) to prevent cluster network starvation during transfer.
Once caught up, the new node enters the ISR set, and the decommissioned broker's replicas are removed.
Safely shut down the old broker (now containing 0 active replicas).

1. Schema Registry & Compatibility Modes ⭐

Decoupling producer and consumer deployments requires strict schema validation to prevent structural breaking changes.

Compatibility Mode	Evolution Rule	Deployment Order
BACKWARD (Default)	New schema can read data written by old schemas. Can add optional fields, delete fields.	Upgrade Consumers first, then upgrade Producers.
FORWARD	Old schema can read data written by new schemas. Can add fields, delete optional fields.	Upgrade Producers first, then upgrade Consumers.
FULL	Schemas are fully bi-directional compatible. Can only add/delete optional fields with defaults.	Deploy independently in any order.

Serialization Format Comparison

Avro vs Protobuf vs JSON Schema:
- Avro: Compact binary, schema evolution built-in, Kafka's native industry choice.
- Protobuf: Compact binary, strong typing, popular in gRPC microservices ecosystems.
- JSON Schema: Human-readable, larger payload footprint, easiest for visual debugging.

Recommendation:
Use Avro for general Kafka environments due to Confluent's extensive native tooling, or Protobuf if already using a gRPC microservices ecosystem.

Wire Message Binary Layout

Loading...

2. Tiered Storage (Solving Cost at Petabyte Scale)

Holding petabytes of historical logs on fast local NVMe SSDs is cost-prohibitive. Tiered storage separates computing and storage.

Hot Tier (Local Disk): Holds last 4-24 hours for real-time consumers (reads hit page cache / SSDs).
Cold Tier (S3 / Object Store): Automatically uploads closed segment files to cheap cloud storage (S3/GCS) for long-term retention. Saves ~75% storage costs.

3. Kafka Connect: CDC and Database Integrations

Instead of writing custom services to move data in/out of Kafka, Kafka Connect provides a robust, distributed connector runtime.

Loading...

 
Supported Connectors catalog out-of-the-box:
- Source Connectors:
  * Debezium (MySQL/PostgreSQL CDC log tailing to Kafka) ⭐
  * JDBC Source (poll-based databases query streaming)
  * S3 Source, Kinesis Source, ActiveMQ/RabbitMQ Source
- Sink Connectors:
  * Elasticsearch/OpenSearch Sink (instant search indexing)
  * S3 / GCS / Azure Blob Sinks (long-term data lake archival)
  * Snowflake, BigQuery, and Redshift Data Warehouse Sinks
  * JDBC Relational DB Sinks

Why not just write a custom consumer?
- Connect handles: offset management, task parallelism, worker load balancing, fault tolerance, monitoring,
  schema registry auto-integration, and transactional exactly-once flows.
- Debezium (CDC) connector takes 10 minutes to configure; writing a bulletproof CDC from scratch takes months.

4. Kafka Streams vs. Apache Flink

For real-time transformations, aggregations, and streaming joins, choose the right processor engine.

JAVA

StreamsBuilder builder = new StreamsBuilder();
builder.stream("orders")
       .filter((k, v) -> v.amount > 100)
       .groupByKey()
       .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
       .count()
       .toStream()
       .to("high-value-order-counts");

// Key Benefits:
// - No separate cluster needed — just a Java library in your app
// - Exactly-once semantics (built on Kafka transactions)
// - RocksDB-backed state stores with changelog topic recovery
// - Simple deployment (just deploy your application jar)

Kafka Streams: Run as a lightweight Java library in your app. Excellent for pure Kafka-to-Kafka streaming and simple microservice transforms.
Apache Flink: Separate dedicated compute cluster. Required for multi-source joins, out-of-order events (watermarks), and complex event processing at scale.

5. Monitoring & Alerting

Keep a close eye on the health of your message broker using these alerts:

Critical Alerts:
- Under-replicated partitions > 0 (Data redundancy at risk)
- Offline partitions > 0 (Topic partition unavailable)
- ISR shrink rate > threshold (Replicas falling out of sync)
- Consumer lag > threshold (Consumers falling behind)
Warning Alerts:
- Disk usage > 80% on any broker node
- p99 request latency > 100ms
- Request handler idle ratio < 30% (Broker CPU overloaded)
- Unclean leader election count > 0

6. Performance Tuning Cheat Sheet (Staff Level)

Producer tuning:
  batch.size=65536 (64 KB)     → larger batches = higher throughput
  linger.ms=10                  → wait 10ms to fill batch
  compression.type=lz4          → best throughput/compression trade-off
  buffer.memory=67108864 (64MB) → buffer for async sends
  acks=all                      → durability (always for important data)

Consumer tuning:
  fetch.min.bytes=1048576 (1 MB)     → wait for 1 MB before returning
  fetch.max.wait.ms=500              → max wait for min.bytes
  max.poll.records=500               → messages per poll() call
  max.poll.interval.ms=300000 (5min) → max time between polls before considered dead

Broker tuning:
  num.io.threads=16                  → disk I/O threads (match # of disks)
  num.network.threads=8              → network threads (match # of NICs)
  socket.send.buffer.bytes=1048576   → 1 MB socket buffer
  log.segment.bytes=1073741824 (1GB) → segment size
  log.retention.hours=168 (7 days)   → retention period
  num.partitions=12                  → default partitions per topic

Golden rules:
- Partitions per broker < 4000 (beyond = high metadata overhead)
- Partition count = max(producer_throughput / single_partition_throughput, consumer_count)

Interview Walkthrough

Start with the log abstraction: an append-only, partitioned commit log — producers append, consumers read at their own offset.
Explain partitions: key-based routing preserves ordering per key; more partitions = higher parallelism but more consumer coordination.
Cover consumer groups: each partition assigned to exactly one consumer in the group — rebalance on consumer join/leave.
Compare delivery semantics: at-most-once (commit offset before process), at-least-once (default), exactly-once (idempotent producer + transactional consumer).
Discuss retention and replay: brokers retain messages for days — consumers can rewind offsets to reprocess historical events.
Address replication: leader-follower per partition, ISR (in-sync replicas), min.insync.replicas for durability vs availability trade-off.
Common pitfall: assuming global ordering across all messages — only per-partition ordering is guaranteed; cross-partition order requires a single partition or application-level sequencing.

1. Broker Technology: Kafka vs RabbitMQ vs Pulsar

Choose the broker model that matches your processing pipeline:

Dimension	Kafka	RabbitMQ	Apache Pulsar
Storage Model	Immutable append-only log	Durable transient queue (delete on ACK)	Tiered architecture (BookKeeper + S3)
Consumer Model	Pull (consumer tracks offsets)	Push (broker tracks delivery state)	Hybrid push/pull (cursor-per-consumer)
Message Replay	✅ Yes (offset seek)	❌ No (deleted on ACK)	✅ Yes (cursor reset)
Multi-tenancy	Weak (quotas per client)	Fair (virtual hosts)	Strong (native namespaces + quotas)
Geo-replication	MirrorMaker2 (external tool)	Shovel/Federation plugins	Built-in (native out of the box)
Latency p99	Low (5-15ms)	Very low (1-3ms)	Low (5-15ms)

2. Consumer Group Rebalancing: The Throughput Killer

Rebalancing triggers Stop-The-World pauses for consumer groups, creating major latency spikes.

EAGER REBALANCING (legacy — "Stop the World"):
  T=0:    Consumer B crashes
  T=0.1s: Coordinator detects missed heartbeat (session.timeout.ms=10s default)
  T=10s:  Coordinator marks B as dead → triggers rebalance
  T=10s:  ALL consumers (A and C too!) STOP processing + revoke ALL partitions
  T=10.5s: A and C re-join group, coordinator reassigns
  T=11s:  A → {P0,P1,P2}, C → {P3,P4,P5}
  T=11s:  Consumers resume
  ⚠️ TOTAL PAUSE: ~11 seconds. ALL partitions stopped. 5M messages delayed during rebalance!

COOPERATIVE INCREMENTAL REBALANCING (modern — preferred):
  T=0:    Consumer B crashes
  T=10s:  Coordinator detects B is dead
  T=10s:  Coordinator sends: "revoke ONLY B's partitions (P2, P3)"
  T=10s:  Consumer A keeps processing P0, P1 (NEVER STOPPED!)
  T=10s:  Consumer C keeps processing P4, P5 (NEVER STOPPED!)
  T=10.5s: B's partitions reassigned: A → {P0,P1,P2}, C → {P3,P4,P5}
  ⚠️ Only B's 2 partitions paused for ~0.5 seconds. A and C NEVER stopped.
  Config: partition.assignment.strategy=CooperativeStickyAssignor

STATIC GROUP MEMBERSHIP (group.instance.id) ⭐:
  Consumer B restarts (rolling deploy) with same instance ID
  → Coordinator recognizes it → assigns SAME partitions → NO rebalance at all
  → Perfect for Kubernetes rolling deployments

3. Choosing the Right Partition Count ⭐

Partition count is an irreversible decision (you can increase, but never decrease partitions without breaking partition-key hash ordering).

partitions = max(
  target_throughput / throughput_per_partition,
  max_consumer_count_per_group
)

Example:
  Target: 100 MB/s for topic. Single partition: ~10 MB/s throughput.
  → 10 partitions minimum for throughput.
  But we want 20 consumers for processing → need 20 partitions.
  → Choose 20 partitions.

Too Few: Limits consumer scaling and throughput; creates single broker hotspots.
Too Many: Exhausts file descriptors (each segment has .log, .index, .timeindex files), delays controller elections on crash, and slows down consumer rebalances.

4. Message Ordering Guarantees: Per-Partition vs Global

Kafka guarantees ordering strictly within a partition.

Global Ordering: Configure a topic with exactly 1 partition. Ensures total order but limits consumer scale and throughput to a single worker's capacity (~100 MB/s).
Per-Key Ordering (Default): Partition by entity ID (e.g., user_id). All events for a specific user land in the same partition and are processed in order.
Common Pitfalls:
- 1. Producer Retries without Idempotency:
  If the producer sends a message but the network ACK is lost, retrying will write a duplicate record. E.g., message 1 fails network ACK → producer retries → message 2 already written. Result: [msg2, msg1] written out of order. Solution: Enforce enable.idempotence=true where the broker uses Producer ID (PID) + sequence numbers to automatically deduplicate.
- 2. Multiple Producers writing to the same partition:
  If Producer A writes an event at T=1, and Producer B writes an event at T=0 but it arrives late (at T=2) due to transit congestion, the log stores them out of chronological order. Solution: Designate a single-writer manager per partition key or implement event-timestamp recovery sorting at the consumer aggregation level.
- 3. Concurrent Consumer processing (multi-threading):
  If a single consumer thread pulls messages 1, 2, and 3, but delegates processing to a thread pool, thread 3 may finish before thread 1. This violates sequential ordering guarantees. Solution: Enforce strict single-threaded processing per partition within the consumer execution loop, or wrap downstream actions in highly idempotent databases.

SLOs & Error Budgets

Metric	Target	Rationale
Producer ack latency (acks=all)	< 10ms p99	Downstream services block on produce
Broker availability	99.95%	Cluster survives single broker loss with RF=3
Under-replicated partitions	0 sustained	Under-replicated = data loss risk on next failure
Consumer fetch latency	< 50ms p99	Includes poll interval + network + disk read

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Broker disk full — stops accepting writes	Log dir disk usage > 95%; produce requests fail with KafkaStorageException	Reduce retention.ms temporarily; add disks; enable tiered storage offload; delete unused topics; alert at 80% threshold
Hot partition — single partition at 10× traffic of others	Per-partition byte rate metric; one partition leader CPU 100%	Fix producer key distribution (avoid constant key); increase partitions and migrate (complex); add leader for hot partition on dedicated broker
Consumer group stuck — offset not advancing	Consumer lag metric flat at high value; no rebalance events	Check max.poll.interval.ms vs processing time; restart stuck consumer; skip poison offset to DLQ; verify broker availability for offset commit

Cost Drivers (Staff lens)

Disk storage: throughput × message_size × retention = dominant (1M msg/sec × 2KB × 7 days ≈ 1.2 PB)
Cross-AZ replication bandwidth: 2× write traffic (leader → followers)
Broker compute: modest compared to storage — network and disk I/O bound

Multi-Region & DR

MirrorMaker 2 for active-passive replication. Primary region serves producers and consumers; DR region has replica topics. Failover: redirect producers to DR (manual or automated). Active-active possible but requires conflict resolution for compacted topics. Offset mapping not automatic across regions — consumer must track separately.