This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Kafka-adjacent problem — interviewers expect delivery semantics (at-most/at-least/exactly-once), consumer groups, offset management, ISR, DLQ, and backpressure. Draw the produce→consume flow. |
| Arch 50 | Add partition rebalancing, consumer lag monitoring, idempotent producers, and transactional consume-process-produce. |
| Arch 75 | Staff: compare with SQS/RabbitMQ, discuss min.insync.replicas trade-offs, and rebalance storm during deploy. |
Interview Prompt
Design a distributed worker queue system where producers enqueue tasks and worker pools consume and process them. Support consumer groups for parallel processing, configurable delivery guarantees, dead letter queues for poison messages, and backpressure when consumers fall behind.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| At-most-once, at-least-once, or exactly-once delivery? | Determines offset commit timing, idempotency requirements, and complexity. |
| Message ordering — global, per-partition, or none? | Ordering requires single consumer per partition; relax ordering for higher parallelism. |
| What's the message size and throughput? | 1 KB × 1M/sec vs 1 MB × 1K/sec drives partition count and storage design. |
| Processing time per message — ms or hours? | Long processing needs visibility timeout (SQS) or pause/resume (Kafka consumer pause). |
Scope
In scope
- Consumer groups and partition assignment
- Delivery semantics (at-least-once default)
- Offset tracking and commit strategies
- In-Sync Replicas (ISR) for durability
- Dead Letter Queue (DLQ) for failed messages
- Backpressure when consumer lag grows
Out of scope (state explicitly)
- Full Kafka broker internals (log segments, controller election)
- Stream processing (Flink) — mention as downstream
- Building a new consensus protocol
Assumptions
- 10M messages/day, 50K messages/sec peak
- At-least-once delivery with idempotent consumers
- Per-partition ordering required
- Max message size: 1 MB; typical: 2 KB
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Enqueue messages: Producers publish messages to named queues via exchanges/routers.
- Dequeue & process: Workers receive messages, process them, and send an ACK; the message is deleted permanently after ACK.
- At-least-once delivery: Unacknowledged messages are automatically redelivered to prevent data loss.
- Visibility timeout: When a worker retrieves a message, it becomes hidden from other workers for a configurable window; if no ACK is received before timeout, it becomes visible again.
- Dead Letter Queue (DLQ): Messages that fail processing N times are moved to a DLQ for offline analysis and debugging.
- Delay queues: Messages can be scheduled to become visible to consumers only after a configurable delay (up to 15 mins).
- Priority queues: Higher-priority messages (0-9 range) are processed before lower-priority ones.
- Message TTL: Messages automatically expire and are discarded if not consumed within a specified Time-To-Live window.
- Fan-out (pub/sub): One published message can be replicated to multiple bound queues via exchange routing.
- Flexible Routing: Support content-based routing (direct routing keys, wildcard topics, headers, or fan-out broadcasts).
- Request-Reply (RPC): Support synchronous-over-async communication using correlation IDs and temporary reply queues.
- Strict FIFO ordering: Optional strict ordering within a message group (using group IDs) with automatic deduplication.
- Durability: Messages must survive broker node crashes (persisted to disk WAL and replicated across a quorum).
- High Availability: 99.99% availability: queue blocks or downtime disrupts entire application processing pipelines.
- Throughput: 50K–100K msg/sec per node for active-memory queues (RabbitMQ) and virtually infinite with auto-sharding (SQS).
- Low Latency: Sub-5 ms p99 latency for in-memory broker operations; sub-20 ms for managed REST-based queues.
- Horizontal Scalability: Dynamically add nodes to increase storage capacity, message routing, and concurrent worker processing limits.
- Message Deduplication: Near-exactly-once within a deduplication window (at-least-once transport + dedup IDs on consumers).
- Backpressure Flow Control: Credit-based flow control to prevent workers from being overwhelmed (push model) and broker protection limits.
| Metric | Value |
|---|---|
| Messages / day | 5 Billion |
| Messages / sec (avg) | ~58,000 / sec |
| Messages / sec (peak) | 200,000 / sec |
| Average message size | 2 KB |
| Network throughput (avg) | 116 MB/s |
| Network throughput (peak) | 400 MB/s |
| Average time-in-queue | 50 ms (real-time) to 5 min (batch queues) |
| Active distinct queues | 50,000 |
| Workers / consumers | 200,000 |
| In-flight messages (unACKed) | 58K × 30s = ~1.74 Million |
| Storage (1-day backlog worst-case) | 5B × 2 KB = 10 TB (Transient Storage) |
Key Design Distinction from Kafka: Storage in worker queues is TRANSIENT. Messages are deleted immediately upon successful ACK. Thus, steady-state storage requirements are extremely small (just pending + in-flight messages), never scaling to petabytes of historical logs.
The HLD comprises four major layers: Producers, the Exchange/Router Layer, the Queue Store (Broker Cluster), and the Worker Pool. Unlike a log, the Broker actively tracks the state of each message.
1. Exchange / Router: Message Routing (RabbitMQ model) ⭐
Producers never send messages directly to a queue. Instead, they publish messages to an Exchange, which routes the messages to one or more queues based on Bindings and a Routing Key.
| Exchange Type | Routing Mechanics | Common Use Case |
|---|---|---|
| Direct | Routes to queues whose binding key exactly matches the message routing key (1-to-1). | routing_key='order.created' → queue 'order-processing' |
| Fanout | Routes to ALL bound queues, ignoring the routing key (broadcast / pub-sub). | order.created → queues 'email', 'sms-notification', and 'analytics-pipeline' |
| Topic | Matches routing keys using wildcards: * (matches exactly one word) and # (matches zero or more words). | binding 'order.*.shipped' matches 'order.us.shipped'; 'order.#' matches 'order.eu.west.processed' |
| Headers | Routes based on message header attributes rather than routing keys. Support 'x-match=all' or 'any'. | Route pdf parsing tasks based on header key 'x-type=pdf' directly to specialized workers. |
2. Queue Lifecycle & Internals ⭐
Unlike Kafka's append-only model, a worker queue manages individual state for each message. The broker maintains several internal structures per queue:
- Ready List: A FIFO list (or priority binary heap) of messages that are waiting to be dispatched to workers.
- Unacked Map: A hash map tracking active worker leases:
Map<delivery_tag → {message, consumer, expiry_time}>. - Delayed Set: A sorted set ordered by
visible_aftertimestamp, managed via a timer wheel or min-heap. - DLQ: A separate fallback queue for messages that have failed processing more than the threshold limit.
3. Storage Engine: Handling Random Deletes
Because messages are deleted instantly upon ACK, storage engines must handle random deletes efficiently. This prevents append-only performance degradation and database fragmentation.
- RabbitMQ (Classic Queues): Keep messages in memory. If memory usage gets too high, the broker pages messages to disk. ACKs trigger lazy deletes on disk, causing file fragmentation that requires garbage collection.
- RabbitMQ (Quorum Queues) ⭐: Raft consensus-based replicated WAL. Messages are appended to the WAL. ACKs are also appended as tombstone records. Segments are compacted and deleted once all messages inside a segment are ACKed.
- SQS-style (Distributed Hash + Storage): Standard distributed database layout where messages are entries.
| Message ID | Queue ID | Payload Body | Visible After (Unix TS) | Receive Count |
|---|---|---|---|---|
| uuid-1 | orders | {'{}'...'}'} | 1710000000 (Ready) | 0 |
| uuid-2 | orders | {'{}'...'}'} | 1710000030 (In-Flight) | 1 |
| uuid-3 | orders | {'{}'...'}'} | 1710000000 (Ready - Poison) | 3 |
4. Visibility Timeout: The Core Mechanism ⭐
This mechanism is the core protocol that achieves at-least-once delivery without heavy distributed transactions.
- A consumer calls
ReceiveMessage. The broker scans and retrieves a message (e.g.,M1). - The broker updates
M1.visible_after = now() + visibility_timeout(e.g., 30 seconds) and moves it to the Unacked Map. - The message is hidden from all other concurrent consumers during this 30-second lease.
- If the worker processes successfully, it calls
DeleteMessage(M1). The broker deletes the message from storage and memory. - If the worker crashes, the lease expires (
now() > visible_after). The message is automatically moved back to the Ready List.
Best Practice Tuning: Always set the visibility timeout to 2-3x the expected message processing time. - Too short (e.g. 5s for a 10s task): Leads to duplicate processing. - Too long (e.g. 10m for a 1s task): Node crashes lead to a 10-minute wait before retry.
5. Message Lifecycle Walkthrough
- A producer publishes a message to an exchange with a routing key.
- The exchange validates bindings and routes the message to appropriate queues.
- The broker persists the message (Quorum WAL write + replication).
- The broker sends a "publisher confirm" ACK back to the producer.
- The broker pushes the message to an idle worker (RabbitMQ) or awaits a long poll (SQS).
- The message status changes to IN-FLIGHT, making it invisible to other workers.
- The worker processes the task.
- The worker sends an ACK → Broker deletes the message permanently. If the worker fails, it sends a NACK → message is requeued immediately. If it crashes → visibility timeout expires and the message is requeued.
- If the delivery attempt count exceeds N, the broker moves the message to the DLQ.
6. Replication Mechanics
- Classic Mirrored Queues (Legacy): Replicates everything to all mirrors. It suffers from write amplification and is vulnerable to network partition split-brain issues.
- Quorum Queues (Raft) ⭐: Each queue is a Raft consensus group (majority-based replication). Producers write to the leader node, which replicates to a majority of followers before acknowledging the write. Raft guarantees a single leader, eliminating split-brains and providing bulletproof durability.
7. Push vs Pull Delivery Models
- RabbitMQ (Push-First): Consumers connect and register. The broker pushes messages to consumers up to a configured
prefetch_countlimit. This guarantees ultra-low sub-millisecond latencies, but requires active backpressure. - SQS (Pull / Long-Polling): Consumers poll the queue with a wait time (up to 20 seconds). If no messages are available, the broker holds the connection open, reducing empty responses. Highly scalable for serverless architectures.
Producer API (SQS-style / HTTP REST)
// Publish message
POST /queue/orders?Action=SendMessage
{
"MessageBody": "{'order_id': 99, 'user_id': 123}",
"DelaySeconds": 0,
"MessageAttributes": {
"x-priority": { "DataType": "Number", "StringValue": "9" }
},
"MessageDeduplicationId": "dedup-order-99",
"MessageGroupId": "user-123"
}
// Response: { "MessageId": "uuid-123-abc", "MD5OfMessageBody": "..." }Producer API (RabbitMQ-style / AMQP)
channel.confirmSelect(); // Enable publisher confirms
channel.basicPublish(
"order-exchange", // Exchange
"order.created", // Routing Key
Buffer.from(payload), // Message Body
{
deliveryMode: 2, // 2 = Persistent (durable)
priority: 9, // Priority 0-9
expiration: "60000", // Message TTL in milliseconds
correlationId: "rpc-req-456",
replyTo: "rpc-reply-queue"
}
);Consumer API (SQS-style / Pull)
// 1. Long poll for messages
POST /queue/orders?Action=ReceiveMessage&MaxNumberOfMessages=10&WaitTimeSeconds=20
// Response: messages[] with receiptHandle
// 2. Acknowledge (ACK) and delete message
POST /queue/orders?Action=DeleteMessage
{
"ReceiptHandle": "opaque-lease-token-xyz-123"
}
// 3. Extend processing lease
POST /queue/orders?Action=ChangeMessageVisibility
{
"ReceiptHandle": "opaque-lease-token-xyz-123",
"VisibilityTimeout": 60 // extend for another 60s
}Consumer API (RabbitMQ-style / Push)
channel.prefetch(10); // Prefetch limit to prevent starvation
channel.consume("order-processing-queue", (msg) => {
if (msg !== null) {
try {
processOrder(msg.content.toString());
channel.ack(msg); // Send ACK to delete message
} catch (err) {
if (shouldRetry(msg)) {
channel.nack(msg, false, true); // Requeue message
} else {
channel.reject(msg, false); // Send to DLQ
}
}
}
});Unlike event logs, which only store record arrays and offsets, the broker's storage must maintain comprehensive lease metadata, attempt counters, and priority metrics for every message.
Control Plane Metadata (ZooKeeper/KRaft Configs)
// GET /queues/orders/config
{
"queue_name": "orders",
"created_timestamp": 1710000000,
"config": {
"visibility_timeout_sec": 30,
"max_receive_count": 5,
"dead_letter_queue_target": "orders-dlq",
"retention_period_sec": 345600, // 4 days
"delay_seconds": 0,
"max_message_bytes": 262144, // 256 KB
"fifo_queue": true,
"content_based_deduplication": true
},
"bindings": [
{ "exchange": "order-events", "routing_key": "order.created" },
{ "exchange": "order-events", "routing_key": "order.cancelled" }
]
}| Potential Failure | System Solution Design |
|---|---|
| Broker Node Crash | Quorum queues use Raft consensus to automatically elect a new leader from followers. Replicated WAL logs prevent committed message loss. |
| Message Loss in Transit | Publisher confirms require the broker to safely persist and replicate a message before sending an ACK to the producer. UnACKed writes are retried. |
| Worker Node Crashes | The visibility timeout lease expires, causing the message to become visible in the Ready List again. Another worker automatically fetches it. |
| Poison Message (Infinite Loop) | The broker increments 'receive_count'. Once 'receive_count > max_receive_count', the message is immediately routed to the DLQ, unblocking the queue. |
| Network Partitions | Raft quorum prevents split-brain. The minority partition pauses writes, while the majority partition continues processing. |
| Worker Queue Overflow | Lazy queues page messages to disk aggressively to preserve RAM. Dynamic horizontal scale-out increases consumer processing capacity. |
1. Publisher Confirms: Absolute Zero Data Loss ⭐
Producers should never assume a sent message is safe. By default, standard publishers are asynchronous and fire-and-forget.Publisher Confirms guarantee message delivery.
- The producer puts the channel into confirm mode.
- The producer publishes a batch of messages.
- The broker receives, writes to WAL, replicates to a majority of quorum nodes, and only then returns a confirm ACK.
- If the broker fails before replicating, no ACK is sent. The producer's timeout fires, triggering a resend to the newly elected leader.
2. Race Condition: Double Processing (Visibility Timeout Race) ⭐
If a worker takes longer than the visibility timeout to process a message, a race condition occurs, leading to duplicate processing and data inconsistency.
Visibility Timeout Race Timeline: T=0s: Worker A retrieves message M1. visible_after = T + 30s. M1 is now invisible. T=25s: Worker A is still processing M1 due to a slow database call or GC pause. T=30s: M1's visibility timeout expires. M1 is marked as READY and visible again. T=31s: Worker B polls the queue, retrieves M1, and begins processing. T=35s: Worker A finishes processing and calls DeleteMessage(receipt_handle_A). T=36s: Worker B finishes processing and calls DeleteMessage(receipt_handle_B). Result: The message M1 is processed TWICE!
Solutions:
- ChangeMessageVisibility (Heartbeat) ⭐: The worker starts a background heartbeat thread. Every 20 seconds during processing, it sends a heartbeat to extend the visibility timeout for another 30 seconds.
- Idempotency Layer ⭐: The consumer must use the message ID as an idempotency key. It wraps database side effects in a conditional write or unique constraint check:
INSERT INTO processed_tasks (msg_id) VALUES (M1) ON CONFLICT DO NOTHING. - Receipt Handle Invalidation: SQS invalidates receipt handles when a message is redelivered. When Worker B retrieves M1, Worker A's old receipt handle becomes invalid. Worker A's delete request will fail, alerting the worker to abort or rollback.
3. Consumer Prefetch Starvation
Under a push model, if the broker pushes too many messages to a slow consumer, other idle consumers will starve, severely degrading system throughput.
Scenario: Queue has 1000 tasks. Prefetch count is set to 200. Consumer A is a slow node. Consumer B is a fast node. - Broker pushes 200 tasks to Consumer A. Consumer A is now busy for 200 seconds. - Broker pushes 200 tasks to Consumer B. Consumer B finishes in 5 seconds and is now idle. - Consumer B is starving, while Consumer A has 195 tasks sitting in its memory buffer! Fix: Keep the prefetch_count small (e.g. 10). This ensures fair work distribution.
4. FIFO Queue Head-of-Line (HoL) Blocking
Strict ordering requires messages within a group to be processed sequentially. If one message fails or processes slowly, it blocks all subsequent messages in that group.
- Cause: Group ID clustering. If all messages use
group_id="all-orders", the queue acts as a single thread. - Fix: Use highly fine-grained message group IDs (e.g.
group_id="order-id-xyz"). This allows messages across different groups to scale out in parallel, while enforcing sequential processing only for related actions.
5. Redelivery Storm After Broker Restart
When a broker crashes and restarts, all in-flight messages (e.g. 20,000 tasks) instantly expire and become visible at the same time. This causes a huge redelivery storm, flooding workers and potentially causing cascading out-of-memory crashes.
- Broker Staggering ⭐: On startup, the broker does not make all expired messages visible instantly. Instead, it staggers them (e.g., releasing 100 messages per second).
- Token Bucket Rate-Limiting: Consumers limit their own message ingestion rates, decoupling fetch rates from queue depth.
1. Delay Queue Implementations
- Approach 1: Visibility Expiry: Write the message with
visible_after = now() + delay_seconds. A cron scanner or background thread continuously polls the min-heap to push active messages to the ready queue. Highly accurate but scales poorly with high queue depths. - Approach 2: Hierarchical Timer Wheels ⭐: An O(1) scheduling structure. The broker maintains tiered circular arrays representing seconds, minutes, and hours. Ticks advance the wheel, dispatching scheduled messages instantly with zero query overhead.
- Approach 3: Redis Sorted Sets: Push delayed items to a sorted set using the target timestamp as the score:
ZADD delayed_queue <timestamp> <msg_body>. Workers poll usingZRANGEBYSCORE delayed_queue -inf <now>to fetch matured messages.
2. Priority Queue Implementation
A priority queue ensures high-priority tasks (e.g. transactional payments) bypass lower-priority ones (e.g. promo emails).
- Internal Ready Sub-Queues: The broker maintains distinct physical sub-queues for each priority level (e.g. Priority 1-10). The consumer dequeue selector always drains the highest non-empty queue first.
- Starvation Mitigation (Aging): If high-priority tasks keep arriving, low-priority tasks will starve. To mitigate this, a background worker increases the priority of messages that have been in the queue for too long.
3. Request-Reply Pattern (RPC over Queue)
Enables synchronous remote procedure calls over asynchronous worker queues.
- The RPC client creates a temporary, exclusive reply queue:
reply-queue-xyz. - The client publishes a request message to the
rpc-queue, attaching a uniquecorrelation_idandreply_to="reply-queue-xyz". - The client blocks and listens on the reply queue.
- The RPC server processes the message and publishes the response back to
reply-queue-xyz, preserving the originalcorrelation_id. - The client receives the response, matches the correlation ID, and returns the result.
Interview Walkthrough
- Frame as a decoupling problem: producers enqueue work, workers pull at their own pace — the queue absorbs traffic spikes.
- Explain at-least-once delivery: message stays invisible during processing (visibility timeout), returns to queue if not acked in time.
- Design the ack flow: worker processes → deletes or acks message → on failure, message reappears after visibility timeout expires.
- Introduce dead-letter queue (DLQ) for messages that exceed max retry count — prevents poison pills from blocking the main queue.
- Discuss priority queues and separate queues per task type to isolate slow jobs (video transcode) from fast ones (email send).
- Cover horizontal scaling: add workers freely because the queue is the shared buffer — no sticky routing required.
- Common pitfall: processing without idempotency when at-least-once delivery can duplicate messages — double charges or duplicate emails.
1. RabbitMQ vs SQS vs Kafka ⭐
Selecting the right broker technology depends heavily on whether your workload is task-driven (worker queues) or data-driven (event streaming logs).
| Feature Metric | RabbitMQ | AWS SQS | Apache Kafka |
|---|---|---|---|
| Architecture Model | Smart Broker (push) | Managed Distributed Queue (poll) | Durable Distributed Log (pull) |
| Message Lifecycle | Deleted instantly after ACK | Deleted instantly after ACK | Retained on disk for retention window |
| Max Throughput | ~50K msg/sec per node | Virtually infinite (sharded) | 1M+ msg/sec (sequential disk writes) |
| Routing Capabilities | Extremely rich (Exchanges, Topics) | None (queue-level only) | Partition key-based hashing only |
| Priority Support | Supported natively (0-255) | No (requires multiple queues) | No |
| Message Replay | No | No | Yes (seek consumer offset) ⭐ |
| Consumer Parallelism | Unlimited concurrent workers | Unlimited concurrent workers | Bounded by partition count |
2. Worker Queue vs. Event Log (Kafka)
| Architectural Dimension | Worker Queue (RabbitMQ / SQS) | Event Log (Kafka) |
|---|---|---|
| Mental Model | Shared Task List / Inbox | Immutable Event Ledger |
| Consumer Progress | Broker tracks individual message state | Consumer maintains its own offset pointer |
| Scale-out Parallelism | Add consumers freely to distribute the load | Adding consumers beyond partition count does nothing |
| Storage Characteristics | Transient. Bounded by active backlog. | Durable. Grows with retention limits. |
3. Credit-Based Flow Control (Backpressure)
To protect the broker and workers from memory overload, systems implement a multi-layered backpressure architecture:
- Level 1: Consumer Prefetch Limit: The broker only pushes up to N messages to a consumer before requiring ACKs.
- Level 2: Connection Blocking: If a worker stops ACKing, the broker pauses TCP reads from that worker's connection socket.
- Level 3: Broker Memory Alarm: If memory usage exceeds 40% of host RAM, the broker blocks all incoming publisher connections, preventing writes.
- Level 4: Broker Disk Alarm: If free disk space drops below 50 MB, the broker blocks all publishers, preserving remaining disk space for ACKs.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Simple queue (SQS/RabbitMQ)
Single queue, multiple workers pull messages. Visibility timeout for at-least-once. DLQ after 3 retries. No ordering guarantee.
Key components: SQS/RabbitMQ · Worker pool · Visibility timeout · DLQ · CloudWatch lag metrics
Move to next phase when: Need per-key ordering and higher throughput than SQS (3K msg/sec limit)
Phase 2 — Partitioned log (Kafka-style)
Topics with partitions for parallelism. Consumer groups with cooperative rebalance. Manual offset commit. ISR with acks=all. Idempotent producers.
Key components: Kafka · Consumer groups · Manual offset commit · ISR · Idempotent producer · DLQ topic
Move to next phase when: Consumer lag during peak; need backpressure and autoscaling
Phase 3 — Managed streaming with autoscaling
Autoscaling consumer groups based on lag. Transactional consume-process-produce for exactly-once. Schema registry for message evolution. Cross-region replication for DR.
Key components: Lag-based autoscaling · Kafka transactions · Schema registry · MirrorMaker DR · Backpressure API
Move to next phase when: Exactly-once required for billing events; cross-region disaster recovery
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Message delivery latency (produce to consume) | < 5s p99 | Queue exists to decouple — not add minutes of delay |
| Consumer lag | < 10K messages sustained | Growing lag = processing falling behind |
| DLQ rate | < 0.01% of messages | Higher = poison messages or systemic processing bug |
| Data loss rate | 0% | acks=all + min.insync.replicas=2 — non-negotiable for tasks |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Poison message blocks partition processing | Single partition lag growing; consumer logs show repeated exception on same offset | Skip to next offset (manual intervention); route to DLQ; fix consumer bug; replay from DLQ after fix |
| Consumer group rebalance storm during deploy | Rebalance events > 20/min; processing pauses; lag spike | Switch to cooperative-sticky assignor; enable static membership; increase session timeout; pause deploy, investigate |
| Broker failure with acks=1 — message loss detected | Producer acks received but message missing from topic; broker died before replication | Upgrade to acks=all; set min.insync.replicas=2; audit unclean leader election settings; replay from producer logs if available |
Cost Drivers (Staff lens)
- Storage: retention × message size × throughput (7-day retention at 50K msg/sec ≈ 60 TB)
- Consumer compute: scales with lag — dominant during catch-up
- Cross-AZ replication bandwidth for ISR
Multi-Region & DR
Single-region consumer groups initially. Cross-region: MirrorMaker replicates topics; consumers in DR region process from replica (lag acceptable for DR). Active-active requires careful offset management — avoid dual-processing. Failover: redirect producers to DR cluster; consumers start from committed offsets on DR.