Design a Distributed Consensus System (Raft / Paxos)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Theoretical backbone — staff interviewers expect you to explain leader election, log replication, and safety without hand-waving. Know Raft's Figure 2 rules.
Arch 50	Compare Raft vs Multi-Paxos vs Zab (ZooKeeper). When to build vs use etcd/consul.
Arch 75	Membership changes (joint consensus), split-brain prevention, and operational failure scenarios.

Interview Prompt

Design a distributed consensus system (like Raft or Paxos) that allows a cluster of nodes to agree on a sequence of values despite failures.

Clarifying Questions (ask before designing)

Question	Why it matters
How many nodes can fail (f)?	Quorum = 2f+1. 5 nodes tolerate 2 failures. Drives cluster sizing.
Are we building consensus or using it (e.g., for a KV store)?	Building Raft is the interview; using etcd is the production answer.
What values are we agreeing on — config changes, client writes, or both?	Membership changes require joint consensus — hardest part of Raft.
Synchronous or asynchronous network model?	Raft assumes eventually-delivered messages; safety holds regardless of timing.

Scope

In scope

Leader election
Log replication
Safety (election safety, log matching, leader completeness)
Fault tolerance (crash failures, not Byzantine)

Out of scope (state explicitly)

Byzantine fault tolerance (PBFT)
Full production KV store on top (mention as application layer)
Performance optimization (batching, pipeline) — mention briefly

Assumptions

Crash-stop failures only (not malicious nodes)
5-node cluster (tolerates 2 failures)
Persistent storage on each node (WAL)

Implement a replicated state machine across N nodes (typically 3, 5, or 7)
Leader election: automatically elect a leader; re-elect on failure
Log replication: leader replicates log entries to followers in order
Linearizable reads and writes: clients see the most recent committed value
Membership changes: add/remove nodes without downtime (joint consensus)
Snapshot support: compact log by snapshotting state machine
Client request forwarding: followers redirect to leader

Metric	Calculation	Value
Cluster size	Given (assumption documented in value)	3 (dev), 5 (prod), 7 (highly critical)
Writes / sec	From Writes / day ÷ 86400 (+ peak factor in value)	10K–100K
Read / sec	From Read / day ÷ 86400 (+ peak factor in value)	100K+ (with read-only followers)
Log entry size	Given	~100–500 bytes
Log growth	100K entries/sec × 200 bytes	20 MB/sec
Snapshot interval	Given (assumption documented in value)	Every 10K entries or 100 MB
Leader election time	Given (assumption documented in value)	150–300ms (Raft election timeout)
Heartbeat interval	Given (assumption documented in value)	50–100ms

Loading...

Raft Leader Election

Normal operation:
  - Leader sends heartbeats (empty AppendEntries) every 100ms
  - Followers reset election timer on heartbeat receipt

Leader failure:
  1. Follower's election timer expires (random: 150-300ms)
  2. Follower increments term, transitions to CANDIDATE
  3. Votes for self, sends RequestVote to all peers
  4. Includes: candidate's term, last log index, last log term
  5. Other nodes vote YES if: candidate's term >= voter's term, voter hasn't voted,
     candidate's log is at least as up-to-date
  6. If candidate gets majority → becomes LEADER
  7. If another leader discovered (higher term) → revert to FOLLOWER

Log Replication

1. Client sends write request to leader
2. Leader appends entry to local log: {term, index, command}
3. Leader sends AppendEntries RPC to all followers
4. Follower checks log consistency at prev_log_index + prev_log_term
5. If mismatch → follower rejects; leader backtracks until logs match
6. When majority reply success → leader commits entry, applies to state machine
7. Commit index propagated to followers in next heartbeat

Inter-Node RPCs (Raft Protocol)

PROTOBUF

service RaftNode {
  rpc AppendEntries(AppendEntriesRequest) returns (AppendEntriesResponse);
  rpc RequestVote(RequestVoteRequest) returns (RequestVoteResponse);
  rpc InstallSnapshot(InstallSnapshotRequest) returns (InstallSnapshotResponse);
}

message AppendEntriesRequest {
  uint64 term = 1; string leader_id = 2;
  uint64 prev_log_index = 3; uint64 prev_log_term = 4;
  repeated LogEntry entries = 5; uint64 leader_commit = 6;
}

message LogEntry {
  uint64 term = 1; uint64 index = 2;
  bytes command = 3; EntryType type = 4;
}

Client API

PUT    /api/kv/{key}           → Write key-value (linearizable)
GET    /api/kv/{key}           → Read key-value (linearizable or stale)
DELETE /api/kv/{key}           → Delete key
GET    /api/cluster/status     → Cluster health, leader info
POST   /api/cluster/add_node   → Add node (membership change)
POST   /api/cluster/remove_node → Remove node

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

WAL (Write-Ahead Log): On-Disk Format

File: wal-segment-00042.log
Each entry (binary, protobuf-encoded):
┌──────────┬──────────┬──────────┬──────────────┬──────────┐
│ Length(4B)│ CRC32(4B)│ Term(8B) │ Index(8B)    │ Data(var)│
└──────────┴──────────┴──────────┴──────────────┴──────────┘

- CRC32 for corruption detection
- fsync after every append (safety) or batch fsync (performance)
- Segment rotation: new file every 64 MB

Persisted State

JSON

{
  "current_term": 42,
  "voted_for": "node-1",
  "log": [...],
  "snapshot": {
    "last_included_index": 10000,
    "last_included_term": 38,
    "state_machine_data": <binary>
  }
}

Raft vs Paxos vs ZAB Comparison

Aspect	Raft	Multi-Paxos	ZAB (ZooKeeper)
Understandability	Designed to be simple	Notoriously complex	Medium
Leader	Strong leader	Proposer (weak leader)	Leader
Log	Contiguous, no gaps	Can have gaps, fill later	Contiguous
Membership change	Joint consensus	Complex	Atomic broadcast
Used by	etcd, CockroachDB, TiKV, Kafka (KRaft)	Cassandra (LWT), Spanner	ZooKeeper, HBase
Rounds for commit	1 (AppendEntries)	2 (prepare + accept)	1 (proposal)

Leader Election Trade-offs

Election timeout tuning:
  Too short (e.g., 50ms):
    ✓ Fast re-election → high availability
    ✗ Network hiccup causes false elections
  
  Too long (e.g., 5 seconds):
    ✓ Stable leadership, no false elections
    ✗ Leader fails → 5s unavailability for writes

  Typical sweet spot: 150-300ms election timeout, 50-100ms heartbeat

The "unavailability window" on leader failure:
  Worst case: ~500ms total write unavailability
  Accepted trade-off vs. allowing multiple leaders → data corruption

When Consensus is NOT Needed

Use cases that DO need consensus:
  - Distributed lock, Leader election, Distributed counter
  - Configuration store, Distributed transactions

Use cases that DON'T need consensus:
  - Read-heavy data: simple primary+replica replication
  - Best-effort counters: Redis INCR, no consensus
  - Event logs (Kafka): partition leader with ISR
  - Caches: Redis cluster with async replication

Rule: use consensus for metadata/coordination; use primary replication for data.

Linearizability vs Sequential vs Eventual

Linearizability (Raft): Every op appears instantaneously between invocation and response.
  Cost: all reads through leader (or verify with majority)

Sequential consistency: Ops appear in a sequence consistent with program order.
  Does not map to real time.

Eventual consistency: All nodes converge eventually. No timing guarantees.

For Raft: Linearizable reads via leader.
  Optimization: Lease-based reads (time-limited lease → local reads without consensus).
  Risk: Clock skew → stale reads. Use monotonic clocks + conservative lease timing.

SLOs & Error Budgets

Metric	Target	Rationale
Write availability (quorum reachable)	99.99%	Cluster unavailable if majority down
Leader failover time	< 30s	Election timeout + log catch-up
Log replication lag	< 100ms p99	Follower behind leader — affects read consistency

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Split brain suspected (two leaders reporting)	Conflicting term numbers in metrics; divergent commit indices	Stop writes immediately; identify true leader (highest term + log completeness); force step-down on stale leader; never manually assign leader without understanding term
Follower log corruption after disk failure	Checksum mismatch on WAL replay; node fails to join cluster	Remove node from cluster → wipe data → rejoin as new follower → leader replicates full log or sends snapshot
Election storm (continuous re-elections)	Leader changes > 5/min; elevated client errors	Check network latency between nodes; increase election timeout; enable pre-vote; verify clock skew isn't causing false timeouts

Cost Drivers (Staff lens)

5+ nodes minimum for production (3 for dev only)
SSD IOPS for WAL append (every write = fsync)
Cross-AZ latency affects replication lag and write latency

Multi-Region & DR

Single-region quorum first. Multi-region: either (1) witness nodes in secondary region (non-voting, reduce cross-region writes) or (2) accept higher write latency for cross-region quorum. CockroachDB uses per-region leases for geo-partitioning.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Raft Leader Election

Log Replication

Raft State Machine (Follower → Candidate → Leader)

Log Matching & Consistency Check

Snapshotting & Log Compaction

Event Bus Design (Kafka)

Inter-Node RPCs (Raft Protocol)

Client API

Common Error Responses

WAL (Write-Ahead Log): On-Disk Format

Persisted State

Split Brain Prevention

Read Scalability

Membership Changes (Joint Consensus)

Performance Optimizations

Where Raft/Paxos Is Used

Multi-Raft

Interview Walkthrough