This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Distributed Lock Manager.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Redlock algorithm, Fencing tokens, Lease-based locks? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Redlock algorithm
- Fencing tokens
- Lease-based locks
- ZooKeeper vs etcd for coordination
- Lock contention & fairness
- Clock drift issues
Out of scope (state explicitly)
- Application business logic that acquires and releases locks
- Building ZooKeeper or etcd from scratch (use as coordination backends)
- Workflow / saga orchestration (#103) — locks are a primitive, not an orchestrator
- Multi-region active-active lock federation (unless staff asks)
Assumptions
- Distributed systems interview — prioritize correctness under partition/failure
- Clarify consistency vs availability trade-off before picking quorum sizes
- Team can run managed Kafka/etcd/RDS; focus on application semantics
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Acquire lock: A process acquires a named lock with a timeout (lease duration)
- Release lock: The lock holder explicitly releases the lock
- Auto-release: Lock automatically released after TTL expires (prevents deadlocks from crashed holders)
- Mutual exclusion: At most one process holds the lock at any time
- Reentrant locks (optional): Same process can acquire the same lock multiple times
- Read-Write locks (optional): Multiple readers OR one writer
- Try-lock: Non-blocking attempt to acquire; return immediately if unavailable
- Lock metadata: See who holds the lock, when it was acquired, when it expires
- Correctness: The lock MUST provide mutual exclusion: this is the primary requirement
- High Availability: Lock service must be highly available (SPOF means entire system blocks)
- Low Latency: Lock acquisition in < 5 ms p99
- Fault Tolerant: Survive node failures without leaving orphaned (permanently held) locks
- Deadlock Prevention: Auto-release via TTL prevents permanent deadlocks
- Fairness: Optional: FIFO ordering for lock waiters
Distributed locks are lightweight: the challenge is correctness, not scale.
| Metric | Calculation | Value |
|---|---|---|
| Active locks | Given concurrent workflows | 1M |
| Lock operations / sec | Given peak | 100K |
| Avg lock hold time | Typical lease | 5 seconds |
| Lock metadata size | owner + token + TTL | 200 bytes |
| Total memory | 1M × 200B | 200 MB |
Approach 1: Single Redis Node (Fastest but Weakest)
ACQUIRE: SET lock:{name} {owner_id} NX EX {ttl_seconds}
NX = only set if Not eXists (atomic), EX = expire after ttl_seconds
Returns "OK" → lock acquired, nil → lock held by someone else
RELEASE: (must be atomic — use Lua script)
if redis.call("GET", key) == owner_id then
return redis.call("DEL", key)
else
return 0 -- not the lock holder; don't delete!
endApproach 2: Redlock Algorithm (Distributed Redis: Safer)
Redlock — 5 Independent Redis Nodes: 1. Record start time 2. Attempt SET key uuid NX PX ttl on all 5 nodes (in parallel) 3. Count successes: if >= 3 nodes succeed AND elapsed time < TTL → lock held 4. If quorum not reached, release on all nodes immediately Survives up to 2 out of 5 Redis instances failing
Approach 3: ZooKeeper-Based Lock (Strongest Correctness)
Lock path: /locks/resource-name/
Algorithm:
1. Create ephemeral sequential znode under /locks/{resource}/
2. Get all children
3. If your znode has the lowest sequence number → you hold the lock
4. If not → set a watch on the znode with the next-lower sequence number
5. When that znode is deleted → you're notified → recheck
6. To release: delete your znode
7. If client crashes: ephemeral znode auto-deletedApproach 4: etcd-Based Lock (Modern Alternative)
lease, _ := client.Grant(ctx, 30) // 30-second TTL lease
_, err := client.Put(ctx, "/locks/resource", "owner-id", clientv3.WithLease(lease.ID))Redlock vs ZooKeeper vs etcd: When to Use Which
Redis SET NX: single-node, <1ms, but no fencing token and async replication can lose locks on failover. Use for non-critical coordination only.
Redlock: quorum on 5 independent Redis nodes; survives 2 failures. Martin Kleppmann critique: clock drift + GC pauses can still cause dual holders: always pair with fencing tokens on the protected resource.
ZooKeeper / etcd: consensus-backed; ephemeral znodes / leases auto-release on session expiry. Built-in monotonic fencing (zxid / revision). Use for leader election, config locks, and anything protecting shared mutable state.
Fencing Tokens (Critical for Correctness)
Without fencing: Client A acquires lock → GC pause → TTL expires Client B acquires lock → writes to storage Client A wakes up → still thinks it holds lock → corrupts B's write With fencing: Lock service returns token=33 (monotonic) Storage rejects writes with token <= last_seen_token (32) Client A's stale write (token=33) rejected if B already wrote with token=34
Lock Granularity & Hold Time
- Coarse lock (whole table): simple but kills throughput
- Row-level lock: higher concurrency; watch for deadlock if acquiring multiple rows
- Keep TTL short (10–30s) + heartbeat extend for long work: limits blast radius of crashed holder
Acquire Lock
acquire(lock_name: string, owner_id: string, ttl_ms: int)
→ {acquired: bool, lock_token: string, expires_at: timestamp}
// With blocking wait
acquire_blocking(lock_name, owner_id, ttl_ms, wait_timeout_ms)
// Non-blocking try
try_acquire(lock_name, owner_id, ttl_ms)
→ {acquired: bool} (returns immediately)Release / Extend / Info
release(lock_name, lock_token) → {released: bool}
extend(lock_name, lock_token, additional_ttl_ms) → {extended: bool, new_expires_at}
info(lock_name) → {held, owner, acquired_at, expires_at}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Redis Lock Entry
Key: lock:{resource_name}
Value: {owner_id}:{uuid_token} // identifies who holds it
TTL: 30 seconds (configurable)ZooKeeper Lock Structure
/locks/
/resource-A/
/lock-0000000001 (ephemeral sequential)
/lock-0000000002
/resource-B/
/lock-0000000001Fencing Token
On lock acquisition, increment a global counter:
fencing_token = INCR global:fence:{lock_name}
When making the protected operation (e.g., writing to DB):
Include fencing_token in the request
Storage server rejects requests with fencing_token < last_seen_tokenThe Martin Kleppmann Problem (Zombie Lock Holders)
Scenario: 1. Client A acquires lock (TTL=30s) 2. Client A starts a long GC pause (60 seconds!) 3. Lock auto-expires at 30s 4. Client B acquires the lock 5. Client A wakes up, thinks it still holds the lock 6. BOTH A and B operate on the shared resource → DATA CORRUPTION Solution: Fencing Tokens T=0: Client A acquires lock, gets fencing_token = 33 T=31: Client B acquires lock, gets fencing_token = 34 T=60: Client A wakes up, sends write with fencing_token = 33 Storage server: "fencing_token 33 < last_seen 34 → REJECT"
| Concern | Solution |
|---|---|
| Lock holder crashes | TTL auto-expires the lock; ZK ephemeral node auto-deletes |
| Network partition | Majority quorum (Redlock/ZK/etcd) ensures only one side can acquire |
| Clock drift (Redlock) | Use bounded clock drift assumption; set TTL conservatively |
| GC pause extends past TTL | Fencing tokens prevent stale holders from causing damage |
| Split brain (two holders) | Consensus-based systems (ZK, etcd) prevent this |
| Deadlock | TTL-based auto-release prevents permanent deadlocks |
Choosing the Right Approach
Lock Renewal / Heartbeat Pattern
Lock acquired with TTL = 30s Background thread: Every 10s, extend lock by 30s If holder crashes → thread stops → lock expires naturally (Redisson implements this as RLock)
Read-Write Locks
- Read lock: Multiple readers allowed simultaneously
- Write lock: Exclusive: no readers or other writers
- ZooKeeper: Read locks don't block other reads; write lock blocks all
Distributed Semaphore
Generalization: Allow N concurrent holders (not just 1). ZooKeeper: Allow lock if count of existing children < N. Redis: DECR semaphore:{name}; if result >= 0 → acquired.
Advisory vs. Mandatory Locks
- Advisory: Lock is cooperative: processes must voluntarily check the lock
- Mandatory: Enforced by the system (e.g., DB row locks). Cannot be bypassed
- Distributed locks are almost always advisory
Interview Walkthrough
- Start with use cases: inventory decrement, leader election, cron singleton — not every coordination need requires a lock.
- Compare Redis Redlock vs ZooKeeper/etcd ephemeral sequential nodes — trade speed for fencing-token safety.
- Explain lease TTL + heartbeat renewal so crashed holders auto-release without manual intervention.
- Cover fencing tokens to prevent stale lock holders from writing after losing the lease.
- Mention when optimistic concurrency (version column) beats a distributed lock for low-contention writes.
- Common pitfall: Redis lock without fencing token — a paused process can resume and corrupt shared state after TTL expiry.
Comparison Summary
| Feature | Redis SET NX | Redlock | ZooKeeper | etcd |
|---|---|---|---|---|
| Consistency | Weak (async replication) | Stronger (majority) | Strong (ZAB consensus) | Strong (Raft) |
| Latency | < 1 ms | ~5-10 ms | ~5-20 ms | ~5-10 ms |
| Complexity | Very simple | Moderate | Complex (JVM, session mgmt) | Moderate |
| Fairness | No | No | Yes (sequential znodes) | Yes (lease revision) |
| Auto-release | TTL | TTL | Ephemeral node + session | Lease TTL |
| Fencing token | Manual | Manual | Built-in (zxid) | Built-in (revision) |
Redlock Algorithm: Step-by-Step
Lock acquisition for resource "inventory:item-42": Generate: lock_key = "lock:inventory:item-42", owner_id = UUID "abc-123", ttl = 10s Send SET lock_key abc-123 NX PX 10000 to ALL 5 Redis instances (in parallel) 3 out of 5 respond OK → majority → Compute validity: elapsed = 50ms, validity = ttl - elapsed - clock_drift_bound validity > 0 → LOCK ACQUIRED
ZooKeeper Lock: Why It's Stronger (and Slower)
Why ZK is correct: - ZAB consensus: CREATE is linearizable → total order guaranteed - Ephemeral: if client crashes → session expires → znode deleted → lock released - No TTL needed: session heartbeat (not wall clock) determines liveness - No clock drift problem: ZK uses logical ordering, not timestamps Performance: Acquire: 2 ZK operations → ~15-20ms Release: 1 ZK operation → ~5-10ms Compare: Redis SET NX → ~0.5ms
Lock Renewal: The Watchdog Pattern
main_thread:
lock = acquire("resource-X", ttl=30s)
watchdog = start_renewal_thread(lock, renewal_interval=10s)
do_critical_work() // may take 45 seconds
watchdog.stop()
lock.release()
renewal_thread (runs every 10 seconds):
success = extend_lock(lock.key, lock.owner_id, new_ttl=30s)
if not success: signal_main_thread_to_abort()Decision Tree: When to Use Each Approach
Is correctness critical (financial, inventory)? YES → ZooKeeper or etcd + fencing tokens NO → Continue below Is latency critical (< 1ms)? YES → Single Redis SET NX (if operations are idempotent) NO → Continue below Do you need to survive Redis failover? YES → Redlock (5 independent Redis) + fencing tokens for critical ops NO → Single Redis SET NX is sufficient Summary: "This is just to avoid redundant work" → Single Redis, no fencing "This protects a database write" → Redlock + fencing token "This involves money" → ZooKeeper/etcd + fencing token, always
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core distributed lock manager flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.