Design a Distributed Cache (Redis / Memcached)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Infrastructure bread-and-butter — nail consistent hashing with virtual nodes, eviction policies (LRU vs LFU), cache stampede mitigation, and write strategies (through/behind/around).
Arch 50	Add hot key detection, cluster rebalancing during node add/remove, and TTL vs explicit invalidation trade-offs.
Arch 75	Staff: discuss cache coherence across multi-DC, near-cache vs remote cache hierarchy, and when NOT to cache (financial ledgers).

Interview Prompt

Design a distributed in-memory cache system (like Redis Cluster or Memcached) that supports get/set/delete with TTL, horizontal scaling across nodes, and high availability. Clients should see a single logical cache with automatic key distribution.

Clarifying Questions (ask before designing)

Question	Why it matters
What's the data size per key and total working set?	100-byte session keys vs 1 MB objects drives memory planning and eviction pressure.
Read:write ratio and consistency requirements?	1000:1 reads favor aggressive caching; write-heavy needs write-through or invalidation strategy.
Can we lose cached data on node failure?	Cache-aside tolerates loss (rebuild from DB). Session store requires replication.
Do clients connect directly or through a proxy?	Smart client (Redis cluster protocol) vs proxy (Twemproxy) affects failover complexity.

Scope

In scope

Consistent hashing with virtual nodes
LRU and LFU eviction policies
Cache stampede prevention
Write-through, write-behind, write-around
Hot key mitigation
Node add/remove with minimal key movement

Out of scope (state explicitly)

Persistent storage / AOF/RDB durability internals
Full Redis command set (streams, pub/sub)
Building a new consensus protocol for cache metadata

Assumptions

10 TB total working set, 1M keys, avg 10 KB/value
100K read QPS, 10K write QPS peak
Cache-aside pattern from application layer
Eventual consistency acceptable (TTL-based freshness)

PUT(key, value, TTL): Store a key-value pair with optional time-to-live
GET(key) → value: Retrieve value by key; return null if not found or expired
DELETE(key): Remove a key-value pair
Support eviction policies: LRU, LFU, FIFO, Random
Support data structures beyond strings: hashes, lists, sets, sorted sets (Redis-like)
Support TTL/expiry on keys
Support atomic operations: increment, compare-and-swap
Distributed across multiple nodes for scalability and availability

Metric	Value
Total data cached	10 TB
Avg key size	100 bytes
Avg value size	1 KB
Entry overhead (metadata)	100 bytes per entry
Entries per node (64 GB RAM)	~50M entries
Nodes needed	10 TB / 64 GB = ~160 nodes (without replication)
With replication (2x)	~320 nodes
Operations / sec (total)	100M
Operations / sec / node	~600K

Loading...

L1: In-Process Cache (Caffeine / Guava)

Local cache within each application server's JVM: no network hop, < 0.1ms latency
Size: ~100 MB per server (limited by JVM heap: don't starve the application)
TTL: Short (30 seconds) to limit staleness: L1 has no cache invalidation from Redis
Eviction: Window TinyLFU (Caffeine): near-optimal hit rates, better than LRU for skewed workloads
Use case: Hottest keys only: configuration data, user session, frequently accessed product data
Trade-off: Each app server has its own L1 → after an update, different servers may return different values for up to 30 seconds (TTL). Acceptable for most read-heavy workloads; NOT suitable for data requiring strong consistency

Request flow:
  1. Check L1 (in-process) → hit? Return immediately (< 0.1ms)
  2. L1 miss → Check L2 (Redis) → hit? Return + populate L1 (< 1ms)
  3. L2 miss → Read from Primary DB → populate L2 + L1 (5-50ms)

Cache Client Library

Embedded in each application server: handles routing, connections, and resilience:

Hash slot routing: Compute CRC16(key) % 16384 → look up slot-to-node mapping → send directly to the responsible master node. No proxy overhead.
Slot map cache: Client stores the full slot→node mapping locally. Refreshed on MOVED redirects or periodically (every 60 seconds).
Connection pooling: Persistent TCP connections to each cache node (pool of 10-20 connections per node). Avoids TCP handshake overhead per request.
Pipelining: Batch multiple commands into a single network round-trip (e.g., fetch 50 keys in one pipeline call → 50x less latency than sequential GETs)
Compression: For values > 1 KB, compress with LZ4 before storing (reduces memory usage and network bandwidth). Transparent to the caller.
Circuit breaker: Per-node circuit breaker: if a node fails 5 consecutive health checks, stop sending requests to it for 30 seconds (fallback to DB or return stale L1 data). Prevents cascading failures when a cache node is slow/down.
Retry with timeout: 2ms timeout per request + 1 retry on failure. Don't let a slow cache node block the application thread (cache miss to DB is better than hanging).

Consistent Hashing: Data Distribution

Why not simple modulo hashing?

node = hash(key) % N → when N changes (node added/removed), almost ALL keys are remapped → massive cache miss storm

Consistent Hashing:

Hash ring: Both keys and nodes are hashed to positions on a circular ring (0 to 2^32-1)
Key assigned to the first node found clockwise from its hash position
Adding a node: Only keys between the new node and its predecessor are remapped (~1/N of keys)
Removing a node: Only that node's keys move to the next node clockwise

Virtual Nodes:

Each physical node has 100-200 virtual nodes spread across the ring
Ensures even distribution (without vnodes, distribution is uneven)
On node failure, load is spread across many surviving nodes (not just one neighbor)

Ring positions:
  vnode_A1: 1000    vnode_B1: 2500    vnode_A2: 5000
  vnode_C1: 6000    vnode_B2: 7500    vnode_C2: 9000
  
  Key "user:123" hashes to 3000 → assigned to vnode_A2 (position 5000)

In-Memory Data Structure: Hash Map

Core storage: Hash table (like Java HashMap or C++ unordered_map)

Hash function: MurmurHash3 or xxHash (fast, low collision)
Collision resolution: Chaining (linked list per bucket) or open addressing
Load factor: Rehash when load factor > 0.75

Memory layout optimization:

Avoid per-entry heap allocations (fragmentation)
Use slab allocator (Memcached approach): Pre-allocate slabs of fixed sizes (64B, 128B, 256B, 512B, 1KB, ...). Each value stored in the smallest slab that fits
Jemalloc (Redis approach): Advanced memory allocator that reduces fragmentation

Eviction Policies: Deep Dive

LRU (Least Recently Used) (⭐):

Evict the entry that hasn't been accessed for the longest time
Implementation: Hash map + Doubly linked list
- On GET: Move entry to head of list (most recently used)
- On eviction: Remove entry from tail (least recently used)
- O(1) for both operations

Approximated LRU (Redis approach):

True LRU requires maintaining a linked list → memory overhead
Redis samples 5 random keys → evicts the one with the oldest access time
Nearly as effective as true LRU with less memory overhead

LFU (Least Frequently Used):

Evict the entry with the lowest access count
Better for workloads with varying popularity (some keys are consistently hot)
Implementation: Morris counter (approximate frequency) with decay over time
Redis uses LFU with logarithmic counter + decay

TTL-Based Eviction:

Lazy expiration: Check TTL on access; if expired, delete and return miss
Active expiration: Background thread periodically samples keys → deletes expired ones
Redis uses both: lazy + periodic sampling (10 times/sec, sample 20 keys)

Replication

Redis Cluster Replication:

Each master node has 1+ replica nodes
Replication is asynchronous (eventual consistency)
Writes go to master → asynchronously replicated to replicas
If master fails → replica promoted to master (automatic failover)

Write flow: Client → Master → ACK to client → async replicate to replicas
Read flow: Reads from master (default) or replicas (for read scaling, with READONLY command)

Trade-off: Async replication means a write ACKed by master might be lost if master crashes before replicating. Acceptable for cache (data can be re-fetched from source of truth).

Cluster Architecture (Redis Cluster Style)

Hash slots: 16,384 hash slots distributed across masters
slot = CRC16(key) % 16384
Each master owns a range of slots (e.g., Master A: 0-5460, Master B: 5461-10922, Master C: 10923-16383)
Client library knows the slot→master mapping → sends directly to correct master
MOVED redirect: If client sends to wrong node, node responds with MOVED → client updates its mapping

In-Memory Entry Structure

C

struct CacheEntry {
    char *key;              // pointer to key string
    char *value;            // pointer to value (or embedded for small values)
    uint32_t key_length;    // 4 bytes
    uint32_t value_length;  // 4 bytes
    uint64_t expiry;        // 8 bytes (Unix timestamp, 0 = no expiry)
    uint64_t last_access;   // 8 bytes (for LRU)
    uint8_t  lfu_counter;   // 1 byte (logarithmic frequency counter)
    struct CacheEntry *prev; // 8 bytes (LRU linked list)
    struct CacheEntry *next; // 8 bytes (LRU linked list)
    // Total overhead per entry: ~50 bytes (excluding key and value)
};

Slab Allocator (Memcached)

Slab Classes:
  Class 1:  64-byte chunks   → for values ≤ 64 bytes
  Class 2:  128-byte chunks  → for values ≤ 128 bytes
  Class 3:  256-byte chunks
  Class 4:  512-byte chunks
  Class 5:  1 KB chunks
  ...
  Class 20: 1 MB chunks

Each slab page = 1 MB, divided into chunks of the class size

Cluster Slot Mapping

Slot Range    Master Node    Replica Node
0 - 5460      Node A         Node A'
5461 - 10922  Node B         Node B'
10923 - 16383 Node C         Node C'

Concern	Solution
Node failure	Replica promoted to master; client library redirected
Data loss on failure	Acceptable: cache is a secondary store. Data can be re-populated from primary DB
Cache stampede	Singleflight pattern: one thread fetches from DB; others wait
Hot key	Local in-process cache (L1) in front of distributed cache (L2)
Network partition	Nodes on both sides continue serving. Inconsistency resolved when partition heals
Thundering herd on cold start	Warm-up script pre-populates cache from DB before accepting traffic

Cache Invalidation Strategies

Strategy	How	When to Use
TTL-based	Key auto-expires after N seconds	Most common; simple
Write-through	Update cache on every DB write	Strong consistency needed
Write-behind	Update cache first, async write to DB	Write-heavy, tolerance for stale DB
Cache-aside (⭐)	App reads from cache; on miss, reads from DB and populates cache	Default strategy
Pub/Sub invalidation	DB change → publish event → all cache nodes invalidate the key	Multi-node, near-real-time consistency

Cache Warming Strategies

On deploy: Pre-populate cache with top 1000 most accessed keys from access logs
Lazy warming: Let cache populate naturally from misses (cold start may cause latency spike)
Hybrid: Warm critical keys proactively; let long tail populate lazily

Monitoring

Hit ratio (target > 90%): hits / (hits + misses)
Memory usage: Per node, per slab class
Eviction rate: High eviction rate → need more memory or better TTLs
Latency percentiles: p50, p95, p99 per operation type
Connection count: Per client, per node
Replication lag: Delay between master write and replica receiving it

Redis vs. Memcached

Feature	Redis	Memcached
Data structures	Hash, List, Set, Sorted Set, Stream	Only strings
Persistence	RDB + AOF	None
Replication	Built-in (async)	None (client-side)
Clustering	Redis Cluster (hash slots)	Client-side consistent hashing
Threading	Single-threaded (6.0+ has I/O threads)	Multi-threaded
Memory efficiency	Higher overhead per key	Lower overhead (slab allocator)
Use case	Feature-rich caching, pub/sub, leaderboards	Simple high-throughput caching

Multi-Level Caching

L1: In-process cache (Caffeine/Guava)    → < 0.1 ms, 100 MB
L2: Distributed cache (Redis)             → ~1-5 ms, 100 GB
L3: CDN cache                             → ~10-50 ms
L4: Database                              → ~10-100 ms

Cache Penetration, Breakdown, and Avalanche

Problem	Description	Solution
Penetration	Query for key that will NEVER exist → always hits DB	Bloom filter; cache null result with short TTL
Breakdown	Hot key expires → thousands of requests simultaneously hit DB	Mutex lock (singleflight); never expire hot keys
Avalanche	Many keys expire at the same time → massive DB load	Jittered TTL (random ±10%); staggered cache warming

Interview Walkthrough

Default to cache-aside (lazy loading) from Caching Patterns — it is the most common pattern and easiest to reason about under failure.
Walk through the failure trinity — penetration, breakdown, avalanche — and pair each with its mitigation (Bloom filter, singleflight, TTL jitter).
Explain shard placement with Consistent Hashing so adding/removing nodes minimizes key redistribution.
Pick an eviction policy (LRU for general workloads) and state your target hit ratio (> 90%) with monitoring on eviction rate.
Layer a multi-level cache: in-process (L1) → Redis (L2) → CDN (L3) → DB, citing latency at each tier.
Compare Redis vs Memcached trade-offs: data structures, persistence, and replication — choose based on required features, not hype.
Discuss cache warming on deploy to avoid cold-start latency spikes that trigger avalanches.
Common pitfall: setting identical TTLs on millions of keys so they all expire simultaneously and stampede the database.

Write-Through vs Write-Behind vs Cache-Aside: The Critical Pattern Choice

Cache-Aside (Lazy Loading) ⭐ (Most Common):
  Read:  App checks cache → miss → read DB → write to cache → return
  Write: App writes to DB → invalidate cache (or do nothing)
  
  ✓ Simple to implement
  ✓ Only requested data is cached (no unnecessary data in cache)
  ✓ Cache failure doesn't prevent reads (just slower — goes to DB)
  ✗ First request is always a cache miss (cold start)
  ✗ Data can become stale if DB is updated without invalidating cache
  Best for: General purpose caching (most applications)

Write-Through:
  Write: App writes to cache → cache synchronously writes to DB → return
  Read:  Always from cache (guaranteed fresh)
  
  ✓ Cache always has the latest data
  ✓ Reads are always fast
  ✗ Every write has double latency (cache + DB)
  ✗ Newly written data may never be read → wasted cache space
  Best for: Read-heavy workloads where consistency is critical

Write-Behind (Write-Back):
  Write: App writes to cache → return immediately → cache async writes to DB
  
  ✓ Lowest write latency (just cache write)
  ✓ Can batch DB writes (1000 cache writes → 1 bulk DB insert)
  ✗ DATA LOSS risk: if cache crashes before async write → data lost forever
  ✗ Complex failure handling
  Best for: Write-heavy workloads where some data loss is acceptable (metrics, counters)

Read-Through:
  Read:  App reads from cache → cache auto-fetches from DB on miss → returns
  (Like cache-aside but cache handles the DB fetch, not the application)
  
  ✓ Application code is simpler
  ✗ Cache must know how to talk to DB (coupling)

Why Single-Threaded Redis Outperforms Multi-Threaded Alternatives

Intuition: "Multi-threaded must be faster than single-threaded!"
Reality: For in-memory key-value operations, single-threaded is BETTER.

Why:
  1. No lock contention: Multi-threaded maps need locks (mutex, CAS). 
     Lock contention at 1M ops/sec creates massive overhead.
     Redis: zero locks, zero contention.
  
  2. Memory operations are fast: A GET/SET in memory takes ~100 nanoseconds.
     Thread context switching takes ~1,000 nanoseconds.
     At Redis's workload, context switching costs MORE than the actual work.
  
  3. Network I/O is the bottleneck, not CPU:
     Parsing a network packet takes ~1 microsecond.
     The in-memory operation takes ~100 nanoseconds.
     CPU is idle 90% of the time, waiting for network.
     
     Redis 6.0+ solution: Keep single-threaded for data operations,
     but use multiple I/O threads for network reading/writing.
     This gives the benefits of multi-threading WHERE IT MATTERS
     without any locking on data operations.

When multi-threading helps (Memcached advantage):
  - Very large values (> 1 KB): CPU time for serialization/copying becomes significant
  - Very high connection count (> 100K): I/O thread pool handles more connections
  - Simple GET/SET only: No complex data structures → locking is simpler

Consistent Hashing: Why It's Non-Negotiable

Without consistent hashing (simple modulo):
  node = hash(key) % N
  
  N = 3 servers: key "user:123" → hash=7 → 7 % 3 = 1 → Server 1
  
  Add a 4th server (N = 4):
  key "user:123" → hash=7 → 7 % 4 = 3 → Server 3 (MOVED!)
  
  Result: ~75% of all keys are remapped → MASSIVE cache miss storm
  All keys hit the database simultaneously → database crashes
  
With consistent hashing:
  Adding Server 4 only moves ~1/N = 25% of keys
  The other 75% stay on their current servers → no cache miss
  
  Cost of getting this wrong:
    1M keys x 75% remapped x $0.001 per DB query = $750 in DB load
    ... and that's for a small system. At 1B keys, it's catastrophic.

Eviction Policy: LRU vs LFU: When Each Wins

LRU (Least Recently Used):
  Evict the key that hasn't been accessed for the longest time
  Assumption: "Recently accessed keys will be accessed again soon"
  
  ✓ Works well for temporal locality (web sessions, recent pages)
  ✗ One-time scan pollutes cache: scanning 1M keys once pushes out hot data
  ✗ Doesn't consider frequency (a key accessed 1000x/day but not in last second 
     gets evicted over a key accessed once 0.5 seconds ago)

LFU (Least Frequently Used):
  Evict the key with the lowest access count
  Assumption: "Frequently accessed keys are more important"
  
  ✓ Better for skewed workloads (80/20 rule: 20% of keys get 80% of requests)
  ✓ One-time scans don't pollute cache
  ✗ "Frequency aging" problem: a key popular last month but not now stays forever
  ✗ New keys start with low frequency → evicted immediately before proving value
  
  Redis's LFU solution:
    - Logarithmic counter (8-bit) → reduces by 50% every 10 minutes
    - New keys start with count = 5 (not 0) → get a fair chance
    - This handles both the aging and cold-start problems

Recommendation:
  - General purpose: LRU (simpler, good enough for most workloads)
  - Mixed traffic with hot spots: LFU (better hit ratio, 5-10% improvement)
  - Redis default: allkeys-lru (Redis lets you configure this easily)

Cache Warming: Cold Start Problem

Scenario: Deploy new cache cluster → all empty → 100% miss rate → 
          all traffic hits DB → DB overwhelmed → cascading failure

Solution 1: Lazy warming (do nothing)
  Let misses populate the cache organically
  ✗ 5-30 minutes of high DB load during warmup
  ✗ Latency spike visible to users
  
Solution 2: Pre-warming ⭐
  Before routing traffic to new cache:
  1. Query access logs: identify top 10K most-accessed keys
  2. Fetch those keys from DB → populate cache
  3. Route traffic to the new cache (70%+ hit rate from minute 1)
  
Solution 3: Cache replication
  Instead of cold start, replicate from existing cache cluster
  Redis: SLAVEOF → full sync → promote to master
  ✓ Zero cold start
  ✗ Requires existing cache to be available

Solution 4: Gradual traffic shift
  Route 10% traffic to new cache → warm up → increase to 25% → 50% → 100%
  ✓ Controlled load on DB
  ✗ Slower rollout

The Thundering Herd / Cache Stampede: Deep Dive

Timeline:
  T=0:     Hot key "product:123" cached (TTL = 60s)
  T=0-59:  1000 requests/sec → all served from cache ✓
  T=60:    Key expires
  T=60.001: 1000 requests arrive simultaneously → ALL miss cache → ALL hit DB
  T=60.050: DB overwhelmed, response time spikes to 5 seconds
  T=60.100: DB connection pool exhausted → errors for all users
  T=65:    One response comes back → cache populated → cache works again
  
  The damage: 5 seconds of downtime from ONE key expiring.

Solutions:

1. Singleflight / Request Coalescing ⭐
   First request: acquires a mutex lock on the key → fetches from DB → populates cache
   Remaining 999 requests: wait for the lock → get the cached result
   Only 1 DB query instead of 1000.
   
   Implementation (Go singleflight pattern):
     mutex = redis.SET("lock:product:123", "1", NX, EX, 5)
     if mutex acquired:
       data = db.fetch("product:123")
       redis.SET("product:123", data, EX, 60)
       return data
     else:
       sleep(50ms)
       return redis.GET("product:123")  // populated by the lock holder

2. Stale-While-Revalidate
   Cache returns STALE data immediately while fetching fresh data in background
   User gets fast (slightly stale) response; cache refreshes asynchronously

3. Jittered TTL
   Instead of TTL = 60s for all keys, use TTL = 60 ± random(0, 10)s
   Keys expire at different times → no synchronized stampede

4. Never Expire + Background Refresh
   Set no TTL. Background worker refreshes popular keys every 30 seconds.
   ✗ Stale data for up to 30 seconds
   ✓ Zero cache misses ever

SLOs & Error Budgets

Metric	Target	Rationale
Cache hit ratio	> 95%	Below 90% means DB becomes bottleneck
Get p99 latency	< 2ms	Cache exists to be faster than DB (10-50ms)
Availability	99.9%	Cache miss fallback to DB — not revenue-critical but latency-critical
Eviction rate	< 50/sec sustained	High eviction = working set exceeds capacity

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Cache cluster failover during traffic spike	Hit ratio drops from 95% to 20%; DB QPS 10× normal; p99 latency > 500ms	Enable L1 local cache emergency mode (extend TTL to 5 min); rate-limit DB reads; pre-warm top-10K keys from DB; scale DB read replicas
Hot key causes single shard CPU 100%	Per-shard CPU metric; single key QPS > 100K; other shards idle	Promote key to L1 replication; split key into sub-keys; add read replicas to hot shard; alert on-call for manual key replication
Memory exhaustion triggers mass eviction	Eviction rate > 1000/sec; hit ratio cliff; latency spike	Emergency: increase maxmemory or add nodes; identify large keys (MEMORY USAGE); set tighter TTLs on low-value keys; disable volatile keys first

Cost Drivers (Staff lens)

RAM: 10 TB working set × $4/GB/month ≈ $40K/month (largest cost)
Cross-AZ network for replication: proportional to write QPS
DB read cost when cache misses — indirect but often exceeds cache infra at low hit ratios

Multi-Region & DR

Cache is regional-first — no cross-region cache coherence (too expensive). Each region has independent cluster; cache-aside rebuilds from regional DB replica. Global hot keys: CDN or read-only global replica with 60s TTL. Invalidation: publish to regional pub/sub, not global sync.