Design a Distributed Lock Manager

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Distributed Lock Manager.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Redlock algorithm, Fencing tokens, Lease-based locks?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Redlock algorithm
Fencing tokens
Lease-based locks
ZooKeeper vs etcd for coordination
Lock contention & fairness
Clock drift issues

Out of scope (state explicitly)

Application business logic that acquires and releases locks
Building ZooKeeper or etcd from scratch (use as coordination backends)
Workflow / saga orchestration (#103) — locks are a primitive, not an orchestrator
Multi-region active-active lock federation (unless staff asks)

Assumptions

Distributed systems interview — prioritize correctness under partition/failure
Clarify consistency vs availability trade-off before picking quorum sizes
Team can run managed Kafka/etcd/RDS; focus on application semantics

Acquire lock: A process acquires a named lock with a timeout (lease duration)
Release lock: The lock holder explicitly releases the lock
Auto-release: Lock automatically released after TTL expires (prevents deadlocks from crashed holders)
Mutual exclusion: At most one process holds the lock at any time
Reentrant locks (optional): Same process can acquire the same lock multiple times
Read-Write locks (optional): Multiple readers OR one writer
Try-lock: Non-blocking attempt to acquire; return immediately if unavailable
Lock metadata: See who holds the lock, when it was acquired, when it expires

Distributed locks are lightweight: the challenge is correctness, not scale.

Metric	Calculation	Value
Active locks	Given concurrent workflows	1M
Lock operations / sec	Given peak	100K
Avg lock hold time	Typical lease	5 seconds
Lock metadata size	owner + token + TTL	200 bytes
Total memory	1M × 200B	200 MB

Loading...

Approach 1: Single Redis Node (Fastest but Weakest)

ACQUIRE:  SET lock:{name} {owner_id} NX EX {ttl_seconds}
  NX = only set if Not eXists (atomic), EX = expire after ttl_seconds
  Returns "OK" → lock acquired, nil → lock held by someone else

RELEASE:  (must be atomic — use Lua script)
  if redis.call("GET", key) == owner_id then
      return redis.call("DEL", key)
  else
      return 0  -- not the lock holder; don't delete!
  end

Approach 2: Redlock Algorithm (Distributed Redis: Safer)

Redlock — 5 Independent Redis Nodes:
1. Record start time
2. Attempt SET key uuid NX PX ttl on all 5 nodes (in parallel)
3. Count successes: if >= 3 nodes succeed AND elapsed time < TTL → lock held
4. If quorum not reached, release on all nodes immediately

Survives up to 2 out of 5 Redis instances failing

Approach 3: ZooKeeper-Based Lock (Strongest Correctness)

Lock path: /locks/resource-name/
Algorithm:
1. Create ephemeral sequential znode under /locks/{resource}/
2. Get all children
3. If your znode has the lowest sequence number → you hold the lock
4. If not → set a watch on the znode with the next-lower sequence number
5. When that znode is deleted → you're notified → recheck
6. To release: delete your znode
7. If client crashes: ephemeral znode auto-deleted

Approach 4: etcd-Based Lock (Modern Alternative)

GO

lease, _ := client.Grant(ctx, 30)  // 30-second TTL lease
_, err := client.Put(ctx, "/locks/resource", "owner-id", clientv3.WithLease(lease.ID))

The Martin Kleppmann Problem (Zombie Lock Holders)

Scenario:
1. Client A acquires lock (TTL=30s)
2. Client A starts a long GC pause (60 seconds!)
3. Lock auto-expires at 30s
4. Client B acquires the lock
5. Client A wakes up, thinks it still holds the lock
6. BOTH A and B operate on the shared resource → DATA CORRUPTION

Solution: Fencing Tokens
  T=0:   Client A acquires lock, gets fencing_token = 33
  T=31:  Client B acquires lock, gets fencing_token = 34
  T=60:  Client A wakes up, sends write with fencing_token = 33
  Storage server: "fencing_token 33 < last_seen 34 → REJECT"

Concern	Solution
Lock holder crashes	TTL auto-expires the lock; ZK ephemeral node auto-deletes
Network partition	Majority quorum (Redlock/ZK/etcd) ensures only one side can acquire
Clock drift (Redlock)	Use bounded clock drift assumption; set TTL conservatively
GC pause extends past TTL	Fencing tokens prevent stale holders from causing damage
Split brain (two holders)	Consensus-based systems (ZK, etcd) prevent this
Deadlock	TTL-based auto-release prevents permanent deadlocks

Choosing the Right Approach

Scenario	Recommendation
Efficiency lock (cache dedup, non-critical)	Single Redis SET NX EX
Correctness lock (billing, inventory)	ZooKeeper or etcd — consensus-backed
Middle ground (important but not financial)	Redlock (5 independent Redis)

Comparison Summary

Feature	Redis SET NX	Redlock	ZooKeeper	etcd
Consistency	Weak (async replication)	Stronger (majority)	Strong (ZAB consensus)	Strong (Raft)
Latency	< 1 ms	~5-10 ms	~5-20 ms	~5-10 ms
Complexity	Very simple	Moderate	Complex (JVM, session mgmt)	Moderate
Fairness	No	No	Yes (sequential znodes)	Yes (lease revision)
Auto-release	TTL	TTL	Ephemeral node + session	Lease TTL
Fencing token	Manual	Manual	Built-in (zxid)	Built-in (revision)

Redlock Algorithm: Step-by-Step

Lock acquisition for resource "inventory:item-42":
  Generate: lock_key = "lock:inventory:item-42", owner_id = UUID "abc-123", ttl = 10s
  
  Send SET lock_key abc-123 NX PX 10000 to ALL 5 Redis instances (in parallel)
  3 out of 5 respond OK → majority →
  
  Compute validity: elapsed = 50ms, validity = ttl - elapsed - clock_drift_bound
  validity > 0 → LOCK ACQUIRED

ZooKeeper Lock: Why It's Stronger (and Slower)

Why ZK is correct:
  - ZAB consensus: CREATE is linearizable → total order guaranteed
  - Ephemeral: if client crashes → session expires → znode deleted → lock released
  - No TTL needed: session heartbeat (not wall clock) determines liveness
  - No clock drift problem: ZK uses logical ordering, not timestamps

Performance:
  Acquire: 2 ZK operations → ~15-20ms
  Release: 1 ZK operation → ~5-10ms
  Compare: Redis SET NX → ~0.5ms

Lock Renewal: The Watchdog Pattern

main_thread:
  lock = acquire("resource-X", ttl=30s)
  watchdog = start_renewal_thread(lock, renewal_interval=10s)
  do_critical_work()  // may take 45 seconds
  watchdog.stop()
  lock.release()

renewal_thread (runs every 10 seconds):
  success = extend_lock(lock.key, lock.owner_id, new_ttl=30s)
  if not success: signal_main_thread_to_abort()

Decision Tree: When to Use Each Approach

Is correctness critical (financial, inventory)?
  YES → ZooKeeper or etcd + fencing tokens
  NO → Continue below

Is latency critical (< 1ms)?
  YES → Single Redis SET NX (if operations are idempotent)
  NO → Continue below

Do you need to survive Redis failover?
  YES → Redlock (5 independent Redis) + fencing tokens for critical ops
  NO → Single Redis SET NX is sufficient

Summary:
  "This is just to avoid redundant work" → Single Redis, no fencing
  "This protects a database write" → Redlock + fencing token
  "This involves money" → ZooKeeper/etcd + fencing token, always

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Approach 1: Single Redis Node (Fastest but Weakest)

Approach 2: Redlock Algorithm (Distributed Redis: Safer)

Approach 3: ZooKeeper-Based Lock (Strongest Correctness)

Approach 4: etcd-Based Lock (Modern Alternative)

Redlock vs ZooKeeper vs etcd: When to Use Which

Fencing Tokens (Critical for Correctness)

Lock Granularity & Hold Time

Acquire Lock

Release / Extend / Info

Common Error Responses

Redis Lock Entry

ZooKeeper Lock Structure

Fencing Token

The Martin Kleppmann Problem (Zombie Lock Holders)

Choosing the Right Approach

Lock Renewal / Heartbeat Pattern

Read-Write Locks

Distributed Semaphore

Advisory vs. Mandatory Locks

Interview Walkthrough