This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Storage engine choice (LSM vs B-tree) with write amplification math. Partitioning via consistent hashing, replication quorum (N/R/W), and CAP trade-off for your consistency model. |
| Arch 50 | Compaction strategies (leveled vs sized-tiered), read repair, hinted handoff, anti-entropy with Merkle trees, and failure detection (phi accrual). |
| Arch 75 | Staff: tunable consistency per request, conflict resolution (LWW vs vector clocks), and operating LSM under sustained write-heavy workloads without read p99 blowups. |
Interview Prompt
Design a distributed key-value store like Dynamo or Cassandra. Support put/get/delete by key, replication across nodes, and configurable durability. Target web-scale workloads with partition tolerance.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What consistency does the client need — eventual or linearizable reads? | Determines R/W quorum sizes and whether you need a leader or CRDT-style merge. |
| Workload shape — read-heavy, write-heavy, or mixed? Value size? | LSM favors writes; B-tree favors read-heavy small values. Value size drives compaction and network cost. |
| Do we need compare-and-swap, transactions, or only single-key ops? | Multi-key transactions require coordination (Percolator-style) vs simple KV is partition-local only. |
| Tunable consistency per request or cluster-wide policy? | Dynamo model: ONE, QUORUM, ALL — staff differentiator for mixed SLAs. |
Scope
In scope
- Partitioning and replication
- Put/get/delete with quorum reads/writes
- Failure handling (node down, network partition)
- On-disk persistence and recovery
- Basic conflict resolution for concurrent writes
Out of scope (state explicitly)
- SQL queries, secondary indexes (unless brief mention)
- Full consensus layer like Raft for every key (optional compare to leader-based)
- Cross-datacenter active-active product semantics
Assumptions
- 1 PB total data, 1M keys/sec write peak, 3M reads/sec
- N=3 replicas, replication factor configurable
- Values 1 KB average; keys are opaque strings
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
put(key, value)— Store a key-value pair; overwrite if key existsget(key)→ value — Retrieve the value for a given key; return null if not founddelete(key)— Remove a key-value pair- Support keys up to 256 bytes, values up to 10 KB
- Support versioning / conflict resolution (optional: last-write-wins or vector clocks)
- Support TTL-based expiration on keys
- Support range queries on keys (if ordered store)
- High Availability: AP system (prefer availability over consistency) — always writable
- Tunable Consistency: Allow clients to choose (eventual, quorum, strong)
- Low Latency: < 10 ms p99 for reads and writes
- Scalability: Horizontal scaling — add nodes to increase capacity linearly
- Durability: Data must survive node failures (replicated)
- Partition Tolerant: System continues operating during network partitions
- Automatic failure detection and recovery
| Metric | Calculation | Value |
|---|---|---|
| Total key-value pairs | Given | 1 billion |
| Average key size | Given | 100 bytes |
| Average value size | Given | 5 KB |
| Storage per entry | Given | ~5.1 KB |
| Total data | Given | 1B × 5.1 KB ≈ 5 TB |
| Replication factor 3 | Given | 15 TB total |
| Writes/sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
| Reads/sec | Derived from daily volume ÷ 86400 (+ peak factor) | 200K |
| Nodes (if 1TB per node) | Given | 15 nodes |
Consistent Hashing with Virtual Nodes
- Why: Distributes data evenly across nodes; adding/removing nodes only affects neighboring keys
- How:
- Hash ring with 2^128 positions (MD5/SHA-256 of key)
- Each physical node owns multiple virtual nodes (150-200 vnodes)
- Key is placed on the ring → walk clockwise to find the first node → that's the coordinator
- Next N-1 distinct physical nodes clockwise are the replicas
- Why Virtual Nodes: Without vnodes, data distribution is uneven. Vnodes ensure uniform distribution and smooth rebalancing when nodes join/leave
Storage Engine — LSM Tree (Log-Structured Merge Tree)
Why LSM over B-Tree: Optimized for write-heavy workloads (sequential disk writes)
Write Path
- Write to WAL (Write-Ahead Log) — append-only file, fsynced on every write for durability
- Format per entry:
| CRC (4B) | Timestamp (8B) | Key_Len (4B) | Val_Len (4B) | Key | Value | - On crash: replay WAL entries to reconstruct the MemTable (no data loss)
- Truncated after MemTable successfully flushes to SSTable
- Format per entry:
- Write to MemTable — in-memory sorted skip list (O(log n) insert/lookup)
- When MemTable reaches threshold (~64 MB) → freeze current MemTable (still readable), create a new empty MemTable for new writes → flush frozen MemTable to disk as an immutable SSTable
- Background compaction merges SSTables to reduce read amplification
Read Path
- Check MemTable (most recent data, in-memory, fastest)
- Check Bloom Filter for each SSTable (probabilistic: "definitely not here" or "maybe here" — eliminates ~99% of unnecessary disk reads)
- If Bloom filter says "maybe" → check SSTable's sparse index → locate data block → read and decompress
- Check SSTables from newest to oldest (Level 0 first, then Level 1, etc.)
- Return first match found
SSTable Internal Structure
- Data blocks: Sorted key-value pairs, compressed (LZ4/Snappy)
- Sparse index: Every Nth key → block offset (binary search to find the right block)
- Bloom filter: Per-SSTable, ~1% false positive rate with 10 bits per entry
- Footer: Metadata, compression codec, format version, checksum
SSTable Levels
- Level 0: Freshly flushed from MemTable — key ranges MAY overlap (multiple SSTables can contain the same key)
- Level 1+: Compacted — key ranges are non-overlapping within a level (each key exists in at most one SSTable per level)
- Each level is ~10× the size of the previous level
Compaction Strategies
- Size-Tiered Compaction (STCS): Merge similarly-sized SSTables when count exceeds threshold. Good for write-heavy workloads. Trade-off: high space amplification (up to 2× during compaction)
- Leveled Compaction (LCS): Promote SSTables from Level N to Level N+1, re-splitting to maintain non-overlapping ranges. Good for read-heavy workloads. Trade-off: high write amplification (rewrites entire level on merge)
Replication
- Replication Factor (N): Configurable, default 3
- Replication Strategy:
- Coordinator writes to N replicas
- Quorum-based: W + R > N ensures consistency
- W=2, R=2, N=3 → strong consistency
- W=1, R=1 → eventual consistency (fast)
- W=3, R=1 → write-heavy consistency
Gossip Protocol (Failure Detection + Membership)
- Why: Decentralized — no single point of failure (unlike Raft/Paxos leader)
- How: Every node periodically (every 1 second) picks a random node and exchanges membership state
- Failure detection: Uses Phi Accrual Failure Detector — calculates a suspicion level (phi) based on heartbeat intervals. Declares node dead when phi exceeds threshold (e.g., 8)
- Information propagated: Node liveness, token ranges, load metrics
- Convergence: With N nodes, information reaches all nodes in O(log N) gossip rounds (~10 rounds for 1000 nodes)
Hinted Handoff (Temporary Failure Handling)
- When: A write's designated replica is DOWN but the write must still succeed (AP system)
- How:
- Coordinator detects that replica B is unreachable
- Instead of failing the write, coordinator writes B's copy to substitute node D
- D stores the data with a hint:
{target: B, key: K, value: V, timestamp: T} - When B recovers (detected via gossip heartbeat resuming), D forwards all hinted data to B
- D deletes the hints after B acknowledges receipt
- Hint TTL: 3 hours — if B doesn't recover within this window, hints are dropped. Full repair via Merkle trees is then needed to restore consistency.
- Sloppy quorum interaction: During partitions, writes can go to ANY W available nodes (not necessarily the "correct" replicas). This is what makes the system AP — always writable.
Anti-Entropy Repair (Merkle Trees)
- Purpose: Detect and repair data inconsistencies between replicas that hinted handoff couldn't fix (e.g., node was down longer than hint TTL)
- How:
- Each node maintains a Merkle tree per key range it owns
- Leaf node = hash(key + value) for each key-value pair
- Parent node = hash(left_child + right_child)
- Periodically (e.g., every hour), two replica nodes compare their Merkle tree root hashes
- If roots match → replicas are identical, done
- If roots differ → recursively walk down the tree comparing children until divergent leaf nodes are found
- Only the divergent keys are synchronized between replicas
- Efficiency: For 1 million keys, a Merkle tree has ~20 levels. If only 10 keys differ, you compare ~200 hashes (not 1 million) — O(log N × diff_count) bandwidth.
- Rebuild: Merkle tree is rebuilt after compaction (SSTables changed)
Conflict Resolution
- When: Two clients write to the same key concurrently (or during a network partition)
- Option 1 — Vector Clocks (Amazon Dynamo):
- Each version carries a vector clock:
{node_A: 3, node_B: 1} - On read, compare vector clocks: if one dominates (all entries ≥), take the dominant one
- If neither dominates (concurrent) → conflict → return BOTH versions to the client → client merges (application-aware resolution, e.g., union of shopping cart items)
- Problem: Vector clock size grows with number of writers — mitigate by truncating oldest entries when clock exceeds N entries
- Each version carries a vector clock:
- Option 2 — Last-Write-Wins (Cassandra):
- Compare timestamps → highest wins → other version silently discarded
- Simpler but can lose data — concurrent writes, one is dropped
- Use Hybrid Logical Clocks (HLC) instead of wall clock to avoid clock skew issues (HLC = max(wall_clock, last_HLC) + 1)
Coordinator Node
- Role: The node that receives the client request (determined by consistent hashing of the key)
- Write coordination:
- Append to local WAL (fsync for durability — survives crashes)
- Write to local MemTable (in-memory, fast)
- Forward write to N-1 replica nodes in parallel
- Wait for W total acknowledgments (including own)
- Return success to client
- If a replica is DOWN: Coordinator uses Hinted Handoff — writes the data to a substitute node with a hint. When the target recovers, the substitute forwards the hinted data.
- Read coordination:
- Send read request to all N replica nodes in parallel
- Wait for R responses
- Return the value with the highest timestamp (or latest vector clock) to the client
- Read Repair: If the R responses contain different versions, the coordinator sends the latest version to the stale replicas in the background — eventually converges all replicas
- Per-node read flow:
- a. Check MemTable (newest data, in-memory)
- b. Check Bloom Filter for each SSTable — if BF says "definitely not here," skip that SSTable entirely (saves disk I/O)
- c. If BF says "maybe" → read SSTable's sparse index → find the data block offset → read and decompress the data block
- d. Return the first match found (newest SSTable first)
Client APIs
put(key: bytes, value: bytes, context: Context) → void
context contains: version info (vector clock), consistency level
get(key: bytes, consistency: Level) → {value: bytes, context: Context}
Returns value + metadata for conflict resolution
delete(key: bytes, context: Context) → void
Actually writes a tombstone (deleted marker)Internal APIs
replicate(key, value, timestamp, vector_clock) → ack
Node-to-node replication
handoff(key_range, data) → ack
Transfer data during node join/leave
repair(key_range) → void
Anti-entropy repair: Merkle tree comparisonCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
On-Disk SSTable Format
The file is divided into sequential blocks, each compressed independently:
- Data Blocks — The actual key/value pairs sorted by key. Each entry stores:
key | timestamp | value. Multiple data blocks exist for large files. - Index Block (sparse) — A sparse index mapping the first key of each data block to its byte offset, enabling binary search to locate the right block.
- Filter Block — A Bloom filter for each data block. Check the filter first before reading the block to avoid unnecessary disk I/O.
- Meta Block — Compression type, encoding info, file creation metadata.
- Footer — Fixed-size (48 bytes), located at the end of the file. Contains offsets for the index block and filter block, plus a magic number for validation.
In-Memory MemTable
Data Structure: Skip List (O(log n) insert, lookup, range scan) Max Size: 64 MB On flush: Becomes immutable, new MemTable created
WAL Entry Format
| CRC (4B) | Timestamp (8B) | Key Length (4B) | Value Length (4B) | Key | Value |
Vector Clock (for conflict resolution)
{
"key": "user:123:profile",
"value": "{name: 'Alice'}",
"vector_clock": {
"node_A": 3,
"node_B": 1,
"node_C": 2
},
"timestamp": 1710320000
}Conflict Resolution:
- If one vector clock dominates the other → no conflict, take the dominant one
- If concurrent (neither dominates) → return both to client, let application resolve (Amazon Dynamo approach)
- Alternative: Last-Write-Wins (LWW) using timestamps — simpler but can lose data
Core Mechanisms
| Mechanism | How It Works |
|---|---|
| Replication | Each key stored on N=3 nodes. Survives up to N-1 simultaneous failures |
| WAL | Every write is first logged to WAL before MemTable. On crash, replay WAL to recover MemTable |
| Hinted Handoff | If a replica is down during write, coordinator stores a 'hint' locally. When the replica recovers, the hint is forwarded |
| Anti-Entropy (Merkle Trees) | Background process compares Merkle trees between replicas to detect and repair inconsistencies |
| Read Repair | During a read, if replicas return different versions, coordinator sends the latest version to stale replicas |
Problem-Specific Fault Tolerance
1. Node Failure (Temporary)
- Hinted Handoff ensures writes are not lost
- Gossip protocol detects failure and informs all nodes
- Reads still succeed from remaining replicas (if R ≤ remaining replicas)
2. Node Failure (Permanent)
- Gossip declares node permanently dead after timeout
- Token reassignment: Virtual nodes of dead node are distributed to remaining nodes
- Data re-replicated from surviving replicas to maintain RF=3
3. Network Partition
- AP system: Both sides of the partition continue accepting reads and writes
- Conflict resolution via vector clocks when partition heals
- Sloppy quorum: Write to any W available nodes (not necessarily the "correct" replicas)
4. Data Corruption
- CRC checksums on every SSTable data block and WAL entry
- Detected during reads → repair from replica
5. Compaction Storms
- Too many SSTables → background compaction consumes all I/O
- Mitigation: Rate-limit compaction I/O, prioritize user reads/writes, use leveled compaction
Merkle Trees for Anti-Entropy
- Each node maintains a Merkle tree per key range
- Leaf = hash of a key-value pair
- Parent = hash of children
- Two nodes compare root hashes → if different, recursively compare children to find divergent keys
- Only divergent keys are synchronized → efficient bandwidth usage
Bloom Filters
- One Bloom filter per SSTable
- Before reading an SSTable from disk, check its Bloom filter
- False positive rate ~1% with 10 bits per entry
- Eliminates 99% of unnecessary disk reads
Tombstones and Garbage Collection
- Deletes write a tombstone (marker) with a timestamp
- Tombstones must persist for a grace period (e.g., 10 days) so all replicas learn about the delete
- After grace period, compaction removes the tombstone
Hot Keys
- If one key is extremely popular → one node becomes a bottleneck
- Solution: Read from all replicas (load balance reads), add a cache layer in front
Tunable Consistency Examples
| Config | Behavior | Use Case |
|---|---|---|
| W=1, R=1 | Fastest, eventual consistency | Session data, click tracking |
| W=2, R=2, N=3 | Strong consistency | User profiles, inventory |
| W=3, R=1 | Strong write, fast read | Write-once, read-many |
| W=1, R=3 | Fast write, strong read | Logging with reliable reads |
Interview Walkthrough
- Frame the problem as a write-heavy, partition-tolerant store — state CAP Theorem choice (AP) and why most KV workloads prefer availability over linearizability.
- Walk through Consistent Hashing with virtual nodes before naming a storage engine; interviewers expect you to explain rebalancing when nodes join or leave.
- Contrast LSM-Tree vs B-Tree using Back-of-the-Envelope Estimation on write amplification — sequential writes matter at millions of ops/sec.
- Explain quorum reads/writes (R + W > N) and when sloppy quorum with hinted handoff is acceptable vs when strict quorum is required.
- Cover conflict resolution: vector clocks preserve concurrent writes, Last-Write-Wins is simpler but silently drops data — pick one and defend it.
- Mention Bloom filters and MemTable as read-path optimizations that keep LSM-Tree latency predictable under Caching Patterns at the hot-key layer.
- Common pitfall: designing as CP (like etcd) when the use case is session/cart data that tolerates seconds of staleness — over-engineering consistency kills availability.
Why LSM-Tree Over B-Tree?
| Factor | LSM-Tree (This Design) | B-Tree (MySQL/PostgreSQL) |
|---|---|---|
| Write performance | O(1) amortized (sequential append) | O(log N) per write (random I/O) |
| Read performance | O(N) worst case (check multiple SSTables) | O(log N) guaranteed |
| Write amplification | High (data rewritten during compaction) | Low (in-place updates) |
| Read amplification | High (may check multiple levels) | Low (single tree traversal) |
| Space amplification | Moderate (old versions during compaction) | Low (in-place updates) |
| Disk I/O pattern | Sequential (HDD-friendly, SSD-excellent) | Random (SSD preferred) |
Why LSM wins for a KV store: KV stores are typically write-heavy (ingesting events, session data, caches). The sequential write pattern of LSM trees achieves 10-100× higher write throughput compared to B-trees. The read penalty is mitigated by Bloom filters (eliminating 99% of false lookups) and the MemTable (most-recent data served from memory).
When B-Tree is better: Read-heavy OLTP workloads with point queries and range scans where predictable read latency matters more than write throughput (e.g., relational databases).
CAP Theorem: Why AP Over CP?
This design chooses AP (Availability + Partition Tolerance): → During a network partition, BOTH sides continue serving reads AND writes → Conflicts are resolved after partition heals (vector clocks / LWW) Alternative CP choice (e.g., etcd, ZooKeeper): → During partition, the minority side REJECTS writes → Guarantees no conflicting writes, but reduces availability Why AP for a general-purpose KV store? 1. Most KV store use cases tolerate stale reads for seconds 2. Shopping cart, session data, user preferences → availability > consistency 3. Downtime costs more than temporary inconsistency 4. Conflict resolution (vector clocks) handles the rare actual conflicts When CP is better: → Configuration management (etcd), leader election, distributed locks → Financial transactions, inventory counters → Any case where "two different values for the same key" is unacceptable
Vector Clocks vs Last-Write-Wins: The Conflict Resolution Trade-off
Vector Clocks (Dynamo approach):
✓ No data loss — concurrent writes are BOTH preserved
✓ Client chooses how to merge (application-aware resolution)
✗ Vector clock size grows with number of writers (unbounded in theory)
✗ Client must handle merge complexity
✗ Garbage collection of old clock entries is tricky
Mitigation: Truncate vector clock when it exceeds N entries (e.g., 10)
→ Lose some causal tracking but bounds size
Last-Write-Wins (Cassandra approach):
✓ Simple — highest timestamp wins, no client merge logic
✓ Fixed-size metadata (just a timestamp)
✗ LOSES data — concurrent writes, one is silently discarded
✗ Depends on synchronized clocks (NTP)
✗ Wall clock issues: clock skew can cause "future" writes to always win
Mitigation: Use Hybrid Logical Clocks (HLC) instead of wall clock
→ Combines wall clock + logical counter for causality without clock skew issuesRecommendation: For interview, discuss vector clocks (shows depth). In production, most systems use LWW (Cassandra) or CRDTs (Riak) because vector clocks' operational complexity is high.
Consistent Hashing: Why Virtual Nodes Are Essential
Without virtual nodes (3 physical nodes on the ring):
Node A: owns 60% of keys (unlucky hash distribution)
Node B: owns 25% of keys
Node C: owns 15% of keys
→ Massively unbalanced! Node A is overloaded.
With virtual nodes (each node has 200 vnodes):
Node A: 200 points on ring → ~33.3% of keys
Node B: 200 points on ring → ~33.3% of keys
Node C: 200 points on ring → ~33.3% of keys
→ Nearly perfectly balanced (law of large numbers)
Additional benefit — heterogeneous hardware:
Beefy server with 32GB RAM → assign 300 vnodes
Small server with 8GB RAM → assign 75 vnodes
→ Capacity-proportional data distribution
When a node DIES:
Without vnodes: ALL its data shifts to ONE neighbor → overload
With vnodes: Its 200 vnodes are spread across the ring →
load spreads evenly across ALL surviving nodesCompaction Strategy: Size-Tiered vs Leveled
| Factor | Size-Tiered (STCS) | Leveled (LCS) |
|---|---|---|
| Write amplification | Low (compact only similarly-sized files) | High (rewrite entire level on promotion) |
| Read amplification | High (many overlapping SSTables to check) | Low (no overlap within a level) |
| Space amplification | High (up to 2× during compaction) | Low (~10% overhead) |
| Best for | Write-heavy, space-insensitive | Read-heavy, space-sensitive |
Decision: Start with STCS (better write performance). If read latency becomes a problem (too many SSTables to check), switch to LCS. Many production deployments use STCS for recent data (high write rate) and LCS for older levels (optimize reads).
Sloppy Quorum vs Strict Quorum
Strict Quorum:
Write MUST go to the designated N replicas for this key
If 2 of 3 replicas are down → write FAILS
→ Higher consistency, lower availability
Sloppy Quorum (this design):
Write goes to ANY W available nodes (even if they're not the "correct" replicas)
Temporarily stored on substitute nodes → "hinted handoff" when original recovers
→ Higher availability, but data temporarily on wrong nodes
Trade-off:
Sloppy quorum during partition:
Client writes to nodes X, Y (substitutes) instead of A, B (correct replicas)
Read from node C (the only correct replica online) → MISS (data on X, Y!)
→ Consistency violation until hinted handoff completes
This is acceptable for AP systems where availability > consistencyStaff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Single-node RocksDB
Embedded LSM, replication via nightly backup. Proves API and on-disk format.
Key components: RocksDB · REST API · Single host
Move to next phase when: Durability and availability require multi-node replication
Phase 2 — Dynamo-style ring (AP)
Consistent hashing, gossip membership, quorum R/W, read repair, Merkle tree anti-entropy. Sloppy quorum + hinted handoff.
Key components: Consistent hash ring · Gossip · Quorum · Merkle repair
Move to next phase when: Clients need stronger consistency or range queries
Phase 3 — Operational scale (multi-DC)
Per-DC rings with cross-DC async replication. Separate compaction service. Tunable CL per request. Formal SLIs on compaction lag and repair backlog.
Key components: Multi-DC async rep · CL=LOCAL_QUORUM · Compaction isolation
Move to next phase when: Clock skew conflicts or compliance requires linearizable subset of keys
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Put/Get availability | 99.99% | Infrastructure tier — many services depend on it |
| Read p99 latency (QUORUM) | < 10ms | User-facing apps block on KV |
| Durability | 99.999999% annual | WAL + RF=3 across racks |
| Compaction backlog | < 24h to drain | Prevents read amp death spiral |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Node join causes mass partition migration | Streaming throughput saturates network; latency spike cluster-wide | Throttle stream rate; add node during low traffic; use incremental rebuild; verify vnode count pre-planned |
| Network partition splits ring | Phi accrual marks half cluster down; conflicting write volumes | CAP choice: favor availability with sloppy quorum, run reconciliation post-heal; pause writes if policy is CP |
| SSTable corruption on one replica | Read repair failures, checksum errors in logs | Rebuild replica from healthy peer streaming; verify Merkle subtree; mark node unhealthy until rebuild completes |
Cost Drivers (Staff lens)
- Disk: raw data × replication × (1 + write amplification from compaction)
- Cross-AZ replication bandwidth during repair and new node bootstrap
- Compaction compute — often 20–30% of cluster CPU at steady state
Multi-Region & DR
Leaderless per region with async cross-region replication (last-writer metadata includes region). Reads LOCAL_QUORUM in home region; global strong consistency only for small metadata keys via separate consensus service — don't force Paxos on every KV op.