Design a Distributed Worker Queue (RabbitMQ / SQS)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Kafka-adjacent problem — interviewers expect delivery semantics (at-most/at-least/exactly-once), consumer groups, offset management, ISR, DLQ, and backpressure. Draw the produce→consume flow.
Arch 50	Add partition rebalancing, consumer lag monitoring, idempotent producers, and transactional consume-process-produce.
Arch 75	Staff: compare with SQS/RabbitMQ, discuss min.insync.replicas trade-offs, and rebalance storm during deploy.

Interview Prompt

Design a distributed worker queue system where producers enqueue tasks and worker pools consume and process them. Support consumer groups for parallel processing, configurable delivery guarantees, dead letter queues for poison messages, and backpressure when consumers fall behind.

Clarifying Questions (ask before designing)

Question	Why it matters
At-most-once, at-least-once, or exactly-once delivery?	Determines offset commit timing, idempotency requirements, and complexity.
Message ordering — global, per-partition, or none?	Ordering requires single consumer per partition; relax ordering for higher parallelism.
What's the message size and throughput?	1 KB × 1M/sec vs 1 MB × 1K/sec drives partition count and storage design.
Processing time per message — ms or hours?	Long processing needs visibility timeout (SQS) or pause/resume (Kafka consumer pause).

Scope

In scope

Consumer groups and partition assignment
Delivery semantics (at-least-once default)
Offset tracking and commit strategies
In-Sync Replicas (ISR) for durability
Dead Letter Queue (DLQ) for failed messages
Backpressure when consumer lag grows

Out of scope (state explicitly)

Full Kafka broker internals (log segments, controller election)
Stream processing (Flink) — mention as downstream
Building a new consensus protocol

Assumptions

10M messages/day, 50K messages/sec peak
At-least-once delivery with idempotent consumers
Per-partition ordering required
Max message size: 1 MB; typical: 2 KB

Enqueue messages: Producers publish messages to named queues via exchanges/routers.
Dequeue & process: Workers receive messages, process them, and send an ACK; the message is deleted permanently after ACK.
At-least-once delivery: Unacknowledged messages are automatically redelivered to prevent data loss.
Visibility timeout: When a worker retrieves a message, it becomes hidden from other workers for a configurable window; if no ACK is received before timeout, it becomes visible again.
Dead Letter Queue (DLQ): Messages that fail processing N times are moved to a DLQ for offline analysis and debugging.
Delay queues: Messages can be scheduled to become visible to consumers only after a configurable delay (up to 15 mins).
Priority queues: Higher-priority messages (0-9 range) are processed before lower-priority ones.
Message TTL: Messages automatically expire and are discarded if not consumed within a specified Time-To-Live window.
Fan-out (pub/sub): One published message can be replicated to multiple bound queues via exchange routing.
Flexible Routing: Support content-based routing (direct routing keys, wildcard topics, headers, or fan-out broadcasts).
Request-Reply (RPC): Support synchronous-over-async communication using correlation IDs and temporary reply queues.
Strict FIFO ordering: Optional strict ordering within a message group (using group IDs) with automatic deduplication.

Metric	Value
Messages / day	5 Billion
Messages / sec (avg)	~58,000 / sec
Messages / sec (peak)	200,000 / sec
Average message size	2 KB
Network throughput (avg)	116 MB/s
Network throughput (peak)	400 MB/s
Average time-in-queue	50 ms (real-time) to 5 min (batch queues)
Active distinct queues	50,000
Workers / consumers	200,000
In-flight messages (unACKed)	58K × 30s = ~1.74 Million
Storage (1-day backlog worst-case)	5B × 2 KB = 10 TB (Transient Storage)

Key Design Distinction from Kafka:
Storage in worker queues is TRANSIENT. Messages are deleted immediately upon successful ACK. 
Thus, steady-state storage requirements are extremely small (just pending + in-flight messages),
never scaling to petabytes of historical logs.

The HLD comprises four major layers: Producers, the Exchange/Router Layer, the Queue Store (Broker Cluster), and the Worker Pool. Unlike a log, the Broker actively tracks the state of each message.

Loading...

1. Exchange / Router: Message Routing (RabbitMQ model) ⭐

Producers never send messages directly to a queue. Instead, they publish messages to an Exchange, which routes the messages to one or more queues based on Bindings and a Routing Key.

Exchange Type	Routing Mechanics	Common Use Case
Direct	Routes to queues whose binding key exactly matches the message routing key (1-to-1).	routing_key='order.created' → queue 'order-processing'
Fanout	Routes to ALL bound queues, ignoring the routing key (broadcast / pub-sub).	order.created → queues 'email', 'sms-notification', and 'analytics-pipeline'
Topic	Matches routing keys using wildcards: * (matches exactly one word) and # (matches zero or more words).	binding 'order.*.shipped' matches 'order.us.shipped'; 'order.#' matches 'order.eu.west.processed'
Headers	Routes based on message header attributes rather than routing keys. Support 'x-match=all' or 'any'.	Route pdf parsing tasks based on header key 'x-type=pdf' directly to specialized workers.

2. Queue Lifecycle & Internals ⭐

Unlike Kafka's append-only model, a worker queue manages individual state for each message. The broker maintains several internal structures per queue:

Ready List: A FIFO list (or priority binary heap) of messages that are waiting to be dispatched to workers.
Unacked Map: A hash map tracking active worker leases: Map<delivery_tag → {message, consumer, expiry_time}>.
Delayed Set: A sorted set ordered by visible_after timestamp, managed via a timer wheel or min-heap.
DLQ: A separate fallback queue for messages that have failed processing more than the threshold limit.

Loading...

3. Storage Engine: Handling Random Deletes

Because messages are deleted instantly upon ACK, storage engines must handle random deletes efficiently. This prevents append-only performance degradation and database fragmentation.

RabbitMQ (Classic Queues): Keep messages in memory. If memory usage gets too high, the broker pages messages to disk. ACKs trigger lazy deletes on disk, causing file fragmentation that requires garbage collection.
RabbitMQ (Quorum Queues) ⭐: Raft consensus-based replicated WAL. Messages are appended to the WAL. ACKs are also appended as tombstone records. Segments are compacted and deleted once all messages inside a segment are ACKed.
SQS-style (Distributed Hash + Storage): Standard distributed database layout where messages are entries.

Message ID	Queue ID	Payload Body	Visible After (Unix TS)	Receive Count
uuid-1	orders	{'{}'...'}'}	1710000000 (Ready)	0
uuid-2	orders	{'{}'...'}'}	1710000030 (In-Flight)	1
uuid-3	orders	{'{}'...'}'}	1710000000 (Ready - Poison)	3

4. Visibility Timeout: The Core Mechanism ⭐

This mechanism is the core protocol that achieves at-least-once delivery without heavy distributed transactions.

A consumer calls ReceiveMessage. The broker scans and retrieves a message (e.g., M1).
The broker updates M1.visible_after = now() + visibility_timeout (e.g., 30 seconds) and moves it to the Unacked Map.
The message is hidden from all other concurrent consumers during this 30-second lease.
If the worker processes successfully, it calls DeleteMessage(M1). The broker deletes the message from storage and memory.
If the worker crashes, the lease expires (now() > visible_after). The message is automatically moved back to the Ready List.

Best Practice Tuning:
Always set the visibility timeout to 2-3x the expected message processing time.
- Too short (e.g. 5s for a 10s task): Leads to duplicate processing.
- Too long (e.g. 10m for a 1s task): Node crashes lead to a 10-minute wait before retry.

5. Message Lifecycle Walkthrough

A producer publishes a message to an exchange with a routing key.
The exchange validates bindings and routes the message to appropriate queues.
The broker persists the message (Quorum WAL write + replication).
The broker sends a "publisher confirm" ACK back to the producer.
The broker pushes the message to an idle worker (RabbitMQ) or awaits a long poll (SQS).
The message status changes to IN-FLIGHT, making it invisible to other workers.
The worker processes the task.
The worker sends an ACK → Broker deletes the message permanently. If the worker fails, it sends a NACK → message is requeued immediately. If it crashes → visibility timeout expires and the message is requeued.
If the delivery attempt count exceeds N, the broker moves the message to the DLQ.

6. Replication Mechanics

Classic Mirrored Queues (Legacy): Replicates everything to all mirrors. It suffers from write amplification and is vulnerable to network partition split-brain issues.
Quorum Queues (Raft) ⭐: Each queue is a Raft consensus group (majority-based replication). Producers write to the leader node, which replicates to a majority of followers before acknowledging the write. Raft guarantees a single leader, eliminating split-brains and providing bulletproof durability.

7. Push vs Pull Delivery Models

RabbitMQ (Push-First): Consumers connect and register. The broker pushes messages to consumers up to a configuredprefetch_count limit. This guarantees ultra-low sub-millisecond latencies, but requires active backpressure.
SQS (Pull / Long-Polling): Consumers poll the queue with a wait time (up to 20 seconds). If no messages are available, the broker holds the connection open, reducing empty responses. Highly scalable for serverless architectures.

Producer API (SQS-style / HTTP REST)

JAVASCRIPT

// Publish message
POST /queue/orders?Action=SendMessage
{
  "MessageBody": "{'order_id': 99, 'user_id': 123}",
  "DelaySeconds": 0,
  "MessageAttributes": {
    "x-priority": { "DataType": "Number", "StringValue": "9" }
  },
  "MessageDeduplicationId": "dedup-order-99",
  "MessageGroupId": "user-123"
}
// Response: { "MessageId": "uuid-123-abc", "MD5OfMessageBody": "..." }

Producer API (RabbitMQ-style / AMQP)

TYPESCRIPT

channel.confirmSelect(); // Enable publisher confirms
channel.basicPublish(
  "order-exchange",      // Exchange
  "order.created",       // Routing Key
  Buffer.from(payload),  // Message Body
  {
    deliveryMode: 2,     // 2 = Persistent (durable)
    priority: 9,         // Priority 0-9
    expiration: "60000", // Message TTL in milliseconds
    correlationId: "rpc-req-456",
    replyTo: "rpc-reply-queue"
  }
);

Consumer API (SQS-style / Pull)

JAVASCRIPT

// 1. Long poll for messages
POST /queue/orders?Action=ReceiveMessage&MaxNumberOfMessages=10&WaitTimeSeconds=20
// Response: messages[] with receiptHandle

// 2. Acknowledge (ACK) and delete message
POST /queue/orders?Action=DeleteMessage
{
  "ReceiptHandle": "opaque-lease-token-xyz-123"
}

// 3. Extend processing lease
POST /queue/orders?Action=ChangeMessageVisibility
{
  "ReceiptHandle": "opaque-lease-token-xyz-123",
  "VisibilityTimeout": 60 // extend for another 60s
}

Consumer API (RabbitMQ-style / Push)

TYPESCRIPT

channel.prefetch(10); // Prefetch limit to prevent starvation
channel.consume("order-processing-queue", (msg) => {
  if (msg !== null) {
    try {
      processOrder(msg.content.toString());
      channel.ack(msg); // Send ACK to delete message
    } catch (err) {
      if (shouldRetry(msg)) {
        channel.nack(msg, false, true); // Requeue message
      } else {
        channel.reject(msg, false); // Send to DLQ
      }
    }
  }
});

Unlike event logs, which only store record arrays and offsets, the broker's storage must maintain comprehensive lease metadata, attempt counters, and priority metrics for every message.

Loading...

Control Plane Metadata (ZooKeeper/KRaft Configs)

JSON

// GET /queues/orders/config
{
  "queue_name": "orders",
  "created_timestamp": 1710000000,
  "config": {
    "visibility_timeout_sec": 30,
    "max_receive_count": 5,
    "dead_letter_queue_target": "orders-dlq",
    "retention_period_sec": 345600, // 4 days
    "delay_seconds": 0,
    "max_message_bytes": 262144, // 256 KB
    "fifo_queue": true,
    "content_based_deduplication": true
  },
  "bindings": [
    { "exchange": "order-events", "routing_key": "order.created" },
    { "exchange": "order-events", "routing_key": "order.cancelled" }
  ]
}

Potential Failure	System Solution Design
Broker Node Crash	Quorum queues use Raft consensus to automatically elect a new leader from followers. Replicated WAL logs prevent committed message loss.
Message Loss in Transit	Publisher confirms require the broker to safely persist and replicate a message before sending an ACK to the producer. UnACKed writes are retried.
Worker Node Crashes	The visibility timeout lease expires, causing the message to become visible in the Ready List again. Another worker automatically fetches it.
Poison Message (Infinite Loop)	The broker increments 'receive_count'. Once 'receive_count > max_receive_count', the message is immediately routed to the DLQ, unblocking the queue.
Network Partitions	Raft quorum prevents split-brain. The minority partition pauses writes, while the majority partition continues processing.
Worker Queue Overflow	Lazy queues page messages to disk aggressively to preserve RAM. Dynamic horizontal scale-out increases consumer processing capacity.

1. Publisher Confirms: Absolute Zero Data Loss ⭐

Producers should never assume a sent message is safe. By default, standard publishers are asynchronous and fire-and-forget.Publisher Confirms guarantee message delivery.

The producer puts the channel into confirm mode.
The producer publishes a batch of messages.
The broker receives, writes to WAL, replicates to a majority of quorum nodes, and only then returns a confirm ACK.
If the broker fails before replicating, no ACK is sent. The producer's timeout fires, triggering a resend to the newly elected leader.

2. Race Condition: Double Processing (Visibility Timeout Race) ⭐

If a worker takes longer than the visibility timeout to process a message, a race condition occurs, leading to duplicate processing and data inconsistency.

Visibility Timeout Race Timeline:
  T=0s:  Worker A retrieves message M1. visible_after = T + 30s. M1 is now invisible.
  T=25s: Worker A is still processing M1 due to a slow database call or GC pause.
  T=30s: M1's visibility timeout expires. M1 is marked as READY and visible again.
  T=31s: Worker B polls the queue, retrieves M1, and begins processing.
  T=35s: Worker A finishes processing and calls DeleteMessage(receipt_handle_A).
  T=36s: Worker B finishes processing and calls DeleteMessage(receipt_handle_B).

Result: The message M1 is processed TWICE!

Solutions:

ChangeMessageVisibility (Heartbeat) ⭐: The worker starts a background heartbeat thread. Every 20 seconds during processing, it sends a heartbeat to extend the visibility timeout for another 30 seconds.
Idempotency Layer ⭐: The consumer must use the message ID as an idempotency key. It wraps database side effects in a conditional write or unique constraint check:INSERT INTO processed_tasks (msg_id) VALUES (M1) ON CONFLICT DO NOTHING.
Receipt Handle Invalidation: SQS invalidates receipt handles when a message is redelivered. When Worker B retrieves M1, Worker A's old receipt handle becomes invalid. Worker A's delete request will fail, alerting the worker to abort or rollback.

3. Consumer Prefetch Starvation

Under a push model, if the broker pushes too many messages to a slow consumer, other idle consumers will starve, severely degrading system throughput.

Scenario: Queue has 1000 tasks. Prefetch count is set to 200.
Consumer A is a slow node. Consumer B is a fast node.
- Broker pushes 200 tasks to Consumer A. Consumer A is now busy for 200 seconds.
- Broker pushes 200 tasks to Consumer B. Consumer B finishes in 5 seconds and is now idle.
- Consumer B is starving, while Consumer A has 195 tasks sitting in its memory buffer!

Fix: Keep the prefetch_count small (e.g. 10). This ensures fair work distribution.

4. FIFO Queue Head-of-Line (HoL) Blocking

Strict ordering requires messages within a group to be processed sequentially. If one message fails or processes slowly, it blocks all subsequent messages in that group.

Cause: Group ID clustering. If all messages use group_id="all-orders", the queue acts as a single thread.
Fix: Use highly fine-grained message group IDs (e.g. group_id="order-id-xyz"). This allows messages across different groups to scale out in parallel, while enforcing sequential processing only for related actions.

5. Redelivery Storm After Broker Restart

When a broker crashes and restarts, all in-flight messages (e.g. 20,000 tasks) instantly expire and become visible at the same time. This causes a huge redelivery storm, flooding workers and potentially causing cascading out-of-memory crashes.

Broker Staggering ⭐: On startup, the broker does not make all expired messages visible instantly. Instead, it staggers them (e.g., releasing 100 messages per second).
Token Bucket Rate-Limiting: Consumers limit their own message ingestion rates, decoupling fetch rates from queue depth.

1. Delay Queue Implementations

Approach 1: Visibility Expiry: Write the message with visible_after = now() + delay_seconds. A cron scanner or background thread continuously polls the min-heap to push active messages to the ready queue. Highly accurate but scales poorly with high queue depths.
Approach 2: Hierarchical Timer Wheels ⭐: An O(1) scheduling structure. The broker maintains tiered circular arrays representing seconds, minutes, and hours. Ticks advance the wheel, dispatching scheduled messages instantly with zero query overhead.
Approach 3: Redis Sorted Sets: Push delayed items to a sorted set using the target timestamp as the score:ZADD delayed_queue <timestamp> <msg_body>. Workers poll usingZRANGEBYSCORE delayed_queue -inf <now> to fetch matured messages.

2. Priority Queue Implementation

A priority queue ensures high-priority tasks (e.g. transactional payments) bypass lower-priority ones (e.g. promo emails).

Internal Ready Sub-Queues: The broker maintains distinct physical sub-queues for each priority level (e.g. Priority 1-10). The consumer dequeue selector always drains the highest non-empty queue first.
Starvation Mitigation (Aging): If high-priority tasks keep arriving, low-priority tasks will starve. To mitigate this, a background worker increases the priority of messages that have been in the queue for too long.

3. Request-Reply Pattern (RPC over Queue)

Enables synchronous remote procedure calls over asynchronous worker queues.

The RPC client creates a temporary, exclusive reply queue: reply-queue-xyz.
The client publishes a request message to the rpc-queue, attaching a unique correlation_id and reply_to="reply-queue-xyz".
The client blocks and listens on the reply queue.
The RPC server processes the message and publishes the response back to reply-queue-xyz, preserving the original correlation_id.
The client receives the response, matches the correlation ID, and returns the result.

Interview Walkthrough

Frame as a decoupling problem: producers enqueue work, workers pull at their own pace — the queue absorbs traffic spikes.
Explain at-least-once delivery: message stays invisible during processing (visibility timeout), returns to queue if not acked in time.
Design the ack flow: worker processes → deletes or acks message → on failure, message reappears after visibility timeout expires.
Introduce dead-letter queue (DLQ) for messages that exceed max retry count — prevents poison pills from blocking the main queue.
Discuss priority queues and separate queues per task type to isolate slow jobs (video transcode) from fast ones (email send).
Cover horizontal scaling: add workers freely because the queue is the shared buffer — no sticky routing required.
Common pitfall: processing without idempotency when at-least-once delivery can duplicate messages — double charges or duplicate emails.

1. RabbitMQ vs SQS vs Kafka ⭐

Selecting the right broker technology depends heavily on whether your workload is task-driven (worker queues) or data-driven (event streaming logs).

Feature Metric	RabbitMQ	AWS SQS	Apache Kafka
Architecture Model	Smart Broker (push)	Managed Distributed Queue (poll)	Durable Distributed Log (pull)
Message Lifecycle	Deleted instantly after ACK	Deleted instantly after ACK	Retained on disk for retention window
Max Throughput	~50K msg/sec per node	Virtually infinite (sharded)	1M+ msg/sec (sequential disk writes)
Routing Capabilities	Extremely rich (Exchanges, Topics)	None (queue-level only)	Partition key-based hashing only
Priority Support	Supported natively (0-255)	No (requires multiple queues)	No
Message Replay	No	No	Yes (seek consumer offset) ⭐
Consumer Parallelism	Unlimited concurrent workers	Unlimited concurrent workers	Bounded by partition count

2. Worker Queue vs. Event Log (Kafka)

Architectural Dimension	Worker Queue (RabbitMQ / SQS)	Event Log (Kafka)
Mental Model	Shared Task List / Inbox	Immutable Event Ledger
Consumer Progress	Broker tracks individual message state	Consumer maintains its own offset pointer
Scale-out Parallelism	Add consumers freely to distribute the load	Adding consumers beyond partition count does nothing
Storage Characteristics	Transient. Bounded by active backlog.	Durable. Grows with retention limits.

3. Credit-Based Flow Control (Backpressure)

To protect the broker and workers from memory overload, systems implement a multi-layered backpressure architecture:

Level 1: Consumer Prefetch Limit: The broker only pushes up to N messages to a consumer before requiring ACKs.
Level 2: Connection Blocking: If a worker stops ACKing, the broker pauses TCP reads from that worker's connection socket.
Level 3: Broker Memory Alarm: If memory usage exceeds 40% of host RAM, the broker blocks all incoming publisher connections, preventing writes.
Level 4: Broker Disk Alarm: If free disk space drops below 50 MB, the broker blocks all publishers, preserving remaining disk space for ACKs.

SLOs & Error Budgets

Metric	Target	Rationale
Message delivery latency (produce to consume)	< 5s p99	Queue exists to decouple — not add minutes of delay
Consumer lag	< 10K messages sustained	Growing lag = processing falling behind
DLQ rate	< 0.01% of messages	Higher = poison messages or systemic processing bug
Data loss rate	0%	acks=all + min.insync.replicas=2 — non-negotiable for tasks

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Poison message blocks partition processing	Single partition lag growing; consumer logs show repeated exception on same offset	Skip to next offset (manual intervention); route to DLQ; fix consumer bug; replay from DLQ after fix
Consumer group rebalance storm during deploy	Rebalance events > 20/min; processing pauses; lag spike	Switch to cooperative-sticky assignor; enable static membership; increase session timeout; pause deploy, investigate
Broker failure with acks=1 — message loss detected	Producer acks received but message missing from topic; broker died before replication	Upgrade to acks=all; set min.insync.replicas=2; audit unclean leader election settings; replay from producer logs if available

Cost Drivers (Staff lens)

Storage: retention × message size × throughput (7-day retention at 50K msg/sec ≈ 60 TB)
Consumer compute: scales with lag — dominant during catch-up
Cross-AZ replication bandwidth for ISR

Multi-Region & DR

Single-region consumer groups initially. Cross-region: MirrorMaker replicates topics; consumers in DR region process from replica (lag acceptable for DR). Active-active requires careful offset management — avoid dual-processing. Failover: redirect producers to DR cluster; consumers start from committed offsets on DR.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

1. Exchange / Router: Message Routing (RabbitMQ model) ⭐

2. Queue Lifecycle & Internals ⭐

3. Storage Engine: Handling Random Deletes

4. Visibility Timeout: The Core Mechanism ⭐

5. Message Lifecycle Walkthrough

6. Replication Mechanics

7. Push vs Pull Delivery Models

Producer API (SQS-style / HTTP REST)

Producer API (RabbitMQ-style / AMQP)

Consumer API (SQS-style / Pull)

Consumer API (RabbitMQ-style / Push)

Control Plane Metadata (ZooKeeper/KRaft Configs)

1. Publisher Confirms: Absolute Zero Data Loss ⭐

2. Race Condition: Double Processing (Visibility Timeout Race) ⭐

3. Consumer Prefetch Starvation

4. FIFO Queue Head-of-Line (HoL) Blocking

5. Redelivery Storm After Broker Restart

1. Delay Queue Implementations

2. Priority Queue Implementation

3. Request-Reply Pattern (RPC over Queue)

Interview Walkthrough

1. RabbitMQ vs SQS vs Kafka ⭐

2. Worker Queue vs. Event Log (Kafka)

3. Credit-Based Flow Control (Backpressure)

Phase 1 — Simple queue (SQS/RabbitMQ)

Phase 2 — Partitioned log (Kafka-style)

Phase 3 — Managed streaming with autoscaling

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR