This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Theoretical backbone — staff interviewers expect you to explain leader election, log replication, and safety without hand-waving. Know Raft's Figure 2 rules. |
| Arch 50 | Compare Raft vs Multi-Paxos vs Zab (ZooKeeper). When to build vs use etcd/consul. |
| Arch 75 | Membership changes (joint consensus), split-brain prevention, and operational failure scenarios. |
Interview Prompt
Design a distributed consensus system (like Raft or Paxos) that allows a cluster of nodes to agree on a sequence of values despite failures.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| How many nodes can fail (f)? | Quorum = 2f+1. 5 nodes tolerate 2 failures. Drives cluster sizing. |
| Are we building consensus or using it (e.g., for a KV store)? | Building Raft is the interview; using etcd is the production answer. |
| What values are we agreeing on — config changes, client writes, or both? | Membership changes require joint consensus — hardest part of Raft. |
| Synchronous or asynchronous network model? | Raft assumes eventually-delivered messages; safety holds regardless of timing. |
Scope
In scope
- Leader election
- Log replication
- Safety (election safety, log matching, leader completeness)
- Fault tolerance (crash failures, not Byzantine)
Out of scope (state explicitly)
- Byzantine fault tolerance (PBFT)
- Full production KV store on top (mention as application layer)
- Performance optimization (batching, pipeline) — mention briefly
Assumptions
- Crash-stop failures only (not malicious nodes)
- 5-node cluster (tolerates 2 failures)
- Persistent storage on each node (WAL)
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Implement a replicated state machine across N nodes (typically 3, 5, or 7)
- Leader election: automatically elect a leader; re-elect on failure
- Log replication: leader replicates log entries to followers in order
- Linearizable reads and writes: clients see the most recent committed value
- Membership changes: add/remove nodes without downtime (joint consensus)
- Snapshot support: compact log by snapshotting state machine
- Client request forwarding: followers redirect to leader
- Safety: Never return incorrect results, even during partitions (no split-brain)
- Liveness: System makes progress as long as majority of nodes are alive (N/2 + 1)
- Latency: Write committed in 1 round-trip (leader → majority followers)
- Durability: Committed entries never lost (persisted to stable storage before ACK)
- Availability: Tolerate (N-1)/2 node failures (e.g., 2 of 5)
- Deterministic: Same log → same state machine state on all nodes
| Metric | Calculation | Value |
|---|---|---|
| Cluster size | Given (assumption documented in value) | 3 (dev), 5 (prod), 7 (highly critical) |
| Writes / sec | From Writes / day ÷ 86400 (+ peak factor in value) | 10K–100K |
| Read / sec | From Read / day ÷ 86400 (+ peak factor in value) | 100K+ (with read-only followers) |
| Log entry size | Given | ~100–500 bytes |
| Log growth | 100K entries/sec × 200 bytes | 20 MB/sec |
| Snapshot interval | Given (assumption documented in value) | Every 10K entries or 100 MB |
| Leader election time | Given (assumption documented in value) | 150–300ms (Raft election timeout) |
| Heartbeat interval | Given (assumption documented in value) | 50–100ms |
Raft Leader Election
Normal operation:
- Leader sends heartbeats (empty AppendEntries) every 100ms
- Followers reset election timer on heartbeat receipt
Leader failure:
1. Follower's election timer expires (random: 150-300ms)
2. Follower increments term, transitions to CANDIDATE
3. Votes for self, sends RequestVote to all peers
4. Includes: candidate's term, last log index, last log term
5. Other nodes vote YES if: candidate's term >= voter's term, voter hasn't voted,
candidate's log is at least as up-to-date
6. If candidate gets majority → becomes LEADER
7. If another leader discovered (higher term) → revert to FOLLOWERLog Replication
1. Client sends write request to leader
2. Leader appends entry to local log: {term, index, command}
3. Leader sends AppendEntries RPC to all followers
4. Follower checks log consistency at prev_log_index + prev_log_term
5. If mismatch → follower rejects; leader backtracks until logs match
6. When majority reply success → leader commits entry, applies to state machine
7. Commit index propagated to followers in next heartbeatRaft State Machine (Follower → Candidate → Leader)
FOLLOWER: Accept AppendEntries from leader; reset election timer on heartbeat. CANDIDATE: On election timeout → increment term, vote for self, RequestVote RPC. LEADER: Send heartbeats; replicate client writes via AppendEntries; commit when majority ack. Safety invariant: if entry committed in term T, it appears in logs of all future leaders.
Log Matching & Consistency Check
AppendEntries carries prev_log_index + prev_log_term. Follower rejects if its log doesn't match at that index — forces leader to decrement until logs align. This overwrites conflicting uncommitted entries on minority partitions after heal.
Snapshotting & Log Compaction
Without snapshots: log grows unbounded → replay on restart takes minutes. InstallSnapshot RPC: leader sends compressed state machine snapshot + last included index/term. Follower discards log prefix before snapshot index. Trade-off: snapshot frequency vs replay time.
Event Bus Design (Kafka)
Topic: distributed_consensus_raft_paxos-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "distributed_consensus_raft_paxos-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: distributed_consensus_raft_paxos-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Distributed Consensus System (Raft / Paxos): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Inter-Node RPCs (Raft Protocol)
service RaftNode {
rpc AppendEntries(AppendEntriesRequest) returns (AppendEntriesResponse);
rpc RequestVote(RequestVoteRequest) returns (RequestVoteResponse);
rpc InstallSnapshot(InstallSnapshotRequest) returns (InstallSnapshotResponse);
}
message AppendEntriesRequest {
uint64 term = 1; string leader_id = 2;
uint64 prev_log_index = 3; uint64 prev_log_term = 4;
repeated LogEntry entries = 5; uint64 leader_commit = 6;
}
message LogEntry {
uint64 term = 1; uint64 index = 2;
bytes command = 3; EntryType type = 4;
}Client API
PUT /api/kv/{key} → Write key-value (linearizable)
GET /api/kv/{key} → Read key-value (linearizable or stale)
DELETE /api/kv/{key} → Delete key
GET /api/cluster/status → Cluster health, leader info
POST /api/cluster/add_node → Add node (membership change)
POST /api/cluster/remove_node → Remove nodeCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
WAL (Write-Ahead Log): On-Disk Format
File: wal-segment-00042.log Each entry (binary, protobuf-encoded): ┌──────────┬──────────┬──────────┬──────────────┬──────────┐ │ Length(4B)│ CRC32(4B)│ Term(8B) │ Index(8B) │ Data(var)│ └──────────┴──────────┴──────────┴──────────────┴──────────┘ - CRC32 for corruption detection - fsync after every append (safety) or batch fsync (performance) - Segment rotation: new file every 64 MB
Persisted State
{
"current_term": 42,
"voted_for": "node-1",
"log": [...],
"snapshot": {
"last_included_index": 10000,
"last_included_term": 38,
"state_machine_data": <binary>
}
}Split Brain Prevention
Scenario: Network partition splits 5-node cluster into [A, B, C] and [D, E] Partition [A, B, C] (3 nodes = majority): - Continues operating normally - All writes succeed Partition [D, E] (2 nodes = minority): - Cannot elect leader (need 3 votes, only have 2) - No writes accepted When partition heals: - D and E discover higher term from majority - Revert to followers, sync log from new leader KEY GUARANTEE: At most ONE leader at any time - Because you need majority to win election - Two majorities always overlap by at least one node
Read Scalability
1. Follower Reads (stale, simplest): Read from any follower 2. Read Index (linearizable, no disk I/O): Leader records commit index → confirms leadership 3. Lease-Based Reads (linearizable, lowest latency): Leader uses time-limited lease 4. Follower Reads with Read Index: Follower asks leader for commit index, waits to catch up
Membership Changes (Joint Consensus)
Raft solution: Joint Consensus (two-phase)
Phase 1: Leader logs C_old,new configuration entry
- Both old AND new configs must agree (majority of old AND majority of new)
Phase 2: Leader logs C_new configuration entry
- Now only new config rules apply
Single-server changes (used in practice by etcd, CockroachDB):
Only add or remove ONE node at a time (3→4→5 or 5→4→3)Performance Optimizations
1. Batching: Buffer multiple client requests, replicate as one AppendEntries 2. Pipeline: Send AppendEntries for index N+1 before N is acknowledged 3. Parallel AppendEntries: Send to all followers simultaneously 4. Pre-vote: Before starting election, candidate asks "would you vote for me?" 5. Read optimization: Lease-based reads (no log append needed for reads)
Where Raft/Paxos Is Used
- etcd (Kubernetes config store): Raft
- ZooKeeper (Kafka, HBase): ZAB ≈ Paxos
- CockroachDB, TiKV: Multi-Raft
- Google Spanner: Multi-Paxos
- Kafka (KRaft mode): Raft
- Consul: Raft
- MongoDB replication: Raft-like
Multi-Raft
Single Raft group: all data on 3-5 nodes → limited by single node's disk/CPU Multi-Raft (CockroachDB, TiKV): - Data split into ranges/regions (e.g., key range [a-f] → one Raft group) - Each range has its own Raft group (3-5 replicas) - Hundreds of Raft groups per node Benefits: Parallel writes across ranges, better load distribution Challenge: Cross-range transactions need 2PC on top of Raft
Interview Walkthrough
- State the goal first: agree on a totally ordered log across unreliable nodes — one leader, majority quorum, linearizable writes.
- Walk through Raft leader election: heartbeat timeout → increment term → RequestVote → majority wins; at most one leader per term.
- Explain log replication via AppendEntries with
prev_log_index+prev_log_term— follower rejects mismatches, leader backtracks until logs align. - Split-brain prevention: minority partition (2 of 5 nodes) cannot elect a leader — majority overlap guarantees safety.
- Cover snapshots for log compaction and joint consensus for membership changes — never change quorum size in a single step.
- Mention real deployments: etcd (Kubernetes), Consul, CockroachDB Multi-Raft — interviewers want concrete examples, not just theory.
- Common pitfall: starting with Paxos notation — Raft's explicit leader model is easier to explain and what most production systems actually implement.
Raft vs Paxos vs ZAB Comparison
| Aspect | Raft | Multi-Paxos | ZAB (ZooKeeper) |
|---|---|---|---|
| Understandability | Designed to be simple | Notoriously complex | Medium |
| Leader | Strong leader | Proposer (weak leader) | Leader |
| Log | Contiguous, no gaps | Can have gaps, fill later | Contiguous |
| Membership change | Joint consensus | Complex | Atomic broadcast |
| Used by | etcd, CockroachDB, TiKV, Kafka (KRaft) | Cassandra (LWT), Spanner | ZooKeeper, HBase |
| Rounds for commit | 1 (AppendEntries) | 2 (prepare + accept) | 1 (proposal) |
Leader Election Trade-offs
Election timeout tuning:
Too short (e.g., 50ms):
✓ Fast re-election → high availability
✗ Network hiccup causes false elections
Too long (e.g., 5 seconds):
✓ Stable leadership, no false elections
✗ Leader fails → 5s unavailability for writes
Typical sweet spot: 150-300ms election timeout, 50-100ms heartbeat
The "unavailability window" on leader failure:
Worst case: ~500ms total write unavailability
Accepted trade-off vs. allowing multiple leaders → data corruptionWhen Consensus is NOT Needed
Use cases that DO need consensus: - Distributed lock, Leader election, Distributed counter - Configuration store, Distributed transactions Use cases that DON'T need consensus: - Read-heavy data: simple primary+replica replication - Best-effort counters: Redis INCR, no consensus - Event logs (Kafka): partition leader with ISR - Caches: Redis cluster with async replication Rule: use consensus for metadata/coordination; use primary replication for data.
Linearizability vs Sequential vs Eventual
Linearizability (Raft): Every op appears instantaneously between invocation and response. Cost: all reads through leader (or verify with majority) Sequential consistency: Ops appear in a sequence consistent with program order. Does not map to real time. Eventual consistency: All nodes converge eventually. No timing guarantees. For Raft: Linearizable reads via leader. Optimization: Lease-based reads (time-limited lease → local reads without consensus). Risk: Clock skew → stale reads. Use monotonic clocks + conservative lease timing.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Single leader + followers (basic Raft)
5 nodes, leader handles all client writes, followers replicate. In-memory state machine for demo.
Key components: Leader election · AppendEntries RPC · Commit index tracking · Persistent log (WAL)
Move to next phase when: Need linearizable reads or membership changes
Phase 2 — Production hardening
Snapshotting for log compaction, pre-vote, lease-based reads, batching AppendEntries.
Key components: Snapshot + InstallSnapshot RPC · Pre-vote · Read index / lease reads · Metrics + alerting
Move to next phase when: Log grows beyond disk; election storms under load
Phase 3 — Application layer (KV store / config service)
State machine applies committed log entries. Watch API for config changes. Multi-DC with witness nodes.
Key components: KV state machine · Watch/notify API · Witness nodes (non-voting) · Automated failover runbooks
Move to next phase when: Config service SLO requires sub-second failover globally
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Write availability (quorum reachable) | 99.99% | Cluster unavailable if majority down |
| Leader failover time | < 30s | Election timeout + log catch-up |
| Log replication lag | < 100ms p99 | Follower behind leader — affects read consistency |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Split brain suspected (two leaders reporting) | Conflicting term numbers in metrics; divergent commit indices | Stop writes immediately; identify true leader (highest term + log completeness); force step-down on stale leader; never manually assign leader without understanding term |
| Follower log corruption after disk failure | Checksum mismatch on WAL replay; node fails to join cluster | Remove node from cluster → wipe data → rejoin as new follower → leader replicates full log or sends snapshot |
| Election storm (continuous re-elections) | Leader changes > 5/min; elevated client errors | Check network latency between nodes; increase election timeout; enable pre-vote; verify clock skew isn't causing false timeouts |
Cost Drivers (Staff lens)
- 5+ nodes minimum for production (3 for dev only)
- SSD IOPS for WAL append (every write = fsync)
- Cross-AZ latency affects replication lag and write latency
Multi-Region & DR
Single-region quorum first. Multi-region: either (1) witness nodes in secondary region (non-voting, reduce cross-region writes) or (2) accept higher write latency for cross-region quorum. CockroachDB uses per-region leases for geo-partitioning.