Design a Leaderboard System

Interview Prompt

Design Leaderboard System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Redis sorted sets, Rank queries, Sharded leaderboards?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Redis sorted sets
Rank queries
Sharded leaderboards
Time-scoped boards
Tie-breaking
Capacity estimation with shown math

Out of scope (state explicitly)

Full catalog/search infrastructure (#12)
Payment checkout flow (#24)
Fraud and abuse ML pipelines

Assumptions

Clarify scale (DAU, QPS, data volume) for leaderboard system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Real-time global leaderboard showing top-K users by score
User can query their own rank among millions of players
Update scores in real time (increment, set, decay)
Multiple leaderboards: daily, weekly, seasonal, all-time, per-game-mode
Friends leaderboard: rank among friends only
Historical leaderboards: view past week/season final standings
Tie-breaking: same score → earlier achievement time ranks higher
Percentile ranking: "You are in the top 5% of all players"

Metric	Calculation	Value
Total players	Given (assumption documented in value)	100M
Active players per leaderboard	Given (assumption documented in value)	10M
Score updates / sec (peak)	From Score updates / day ÷ 86400 (+ peak factor in value)	100K
Top-K queries / sec	From Top-K queries / day ÷ 86400 (+ peak factor in value)	50K
Rank lookup queries / sec	From Rank lookup queries / day ÷ 86400 (+ peak factor in value)	200K
Leaderboards (total)	Given	~1,000
Memory per leaderboard	Given	~1.5 GB

Loading...

Redis Sorted Set: Core Data Structure

ZADD lb:weekly {score} {user_id}           → O(log N) insert/update
ZREVRANK lb:weekly {user_id}               → O(log N) get user's rank
ZREVRANGE lb:weekly 0 99 WITHSCORES        → O(log N + K) top K
ZINCRBY lb:weekly {delta} {user_id}        → O(log N) atomic increment
ZREVRANGE lb:weekly rank-5 rank+5          → O(log N + K) neighbors

Internal: Skip List — balanced probabilistic data structure
  10M members: ~20 operations per query (log₂(10M) ≈ 23)
  Each operation: ~1µs → total: ~20µs per query
  Memory: ~1.5 GB for 10M entries (key + score + overhead)

Tie-Breaking Mechanism

Problem: Two users have score=1000 → who ranks higher?
Rule: Earlier achievement = higher rank

Implementation: Encode timestamp into the score
  score_key = score × 10^10 + (MAX_TIMESTAMP - actual_timestamp)

Example (MAX_TIMESTAMP = 9999999999, must be < 10^10 so it never
         disturbs the score ordering):
  User A: score=1000 at t=100 → key = 1000_0000000000 + 9999999899
  User B: score=1000 at t=200 → key = 1000_0000000000 + 9999999799
  User A's key > User B's key → User A ranks higher ✓

Scaling Beyond 100M Members

Approach 1: Single Redis Sorted Set (recommended up to ~100M)
  100M × 150 bytes = 15 GB → fits in single large Redis instance
  All operations still O(log N) ≈ O(27)

Approach 2: Sharded by Score Range (for > 100M)
  Shard 0: scores 0-999 → Redis instance 0
  Shard N: scores 9000-9999 → Redis instance N
  Top-K: query highest shard → get top from that shard
  Rank: sum counts in all higher shards + ZREVRANK within shard

Approach 3: Fenwick Tree / Binary Indexed Tree (for percentile rank)
  Score range [0, MAX_SCORE] → array of counts
  rank(score) = prefix_sum(MAX_SCORE) - prefix_sum(score)
  O(log S) where S = max score range, regardless of user count

Friends Leaderboard

Fetch all friend IDs from social graph, ZSCORE lb:weekly {friend_id} for each friend (pipelined), sort in application. Cost: O(F) where F = friend count.

Percentile Rank at Scale

Approach 1: ZCARD + ZREVRANK (exact, two Redis calls). Approach 2: Maintain 1000 score buckets, each tracking count. percentile = sum(buckets above user's score) / total_users. O(1) lookup, refreshed every 1 minute.

Event Bus Design (Kafka)

Topic: leaderboard_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "leaderboard_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: leaderboard_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Leaderboard System: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

HTTP

# Score updates
POST /api/leaderboard/{lb_id}/score
{ "user_id": "u_123", "score_delta": 50 }

# Read operations
GET /api/leaderboard/{lb_id}/top?k=100            → Top K players
GET /api/leaderboard/{lb_id}/rank/{user_id}        → User's rank + score
GET /api/leaderboard/{lb_id}/around/{user_id}?n=10 → 10 above + 10 below
GET /api/leaderboard/{lb_id}/friends/{user_id}     → Rank among friends
GET /api/leaderboard/{lb_id}/percentile/{user_id}  → "Top 5%" info

# Historical
GET /api/leaderboard/{lb_id}/history?date=2026-03-01 → Archived standings

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

SQL

-- PostgreSQL (Historical Snapshots)
CREATE TABLE leaderboard_snapshots (
    leaderboard_id TEXT,
    snapshot_date  DATE,
    rank           INT,
    user_id        UUID,
    score          BIGINT,
    PRIMARY KEY (leaderboard_id, snapshot_date, rank)
);

-- Redis (Active Leaderboard)
-- ZADD lb:weekly:game1 {score_with_tiebreak} {user_id}
-- HSET user:display:{user_id} name "Alice" avatar_url "..." level 42

Kafka (Score Events)

Topic: score-events
  Key: user_id
  Value: { "lb_id": "weekly:game1", "user_id": "u_123", "delta": 50,
           "new_score": 1250, "timestamp": "..." }

Technique	Application
Redis persistence	RDB snapshots + AOF → recover on restart
Redis Cluster	3 masters + 3 replicas; auto-failover
Rebuild from Kafka	If Redis lost → replay score events → rebuild leaderboard
Read replicas	Separate read replicas for top-K (heavy read load)
Leaderboard rotation	Weekly reset → archive to PostgreSQL → DEL key

Leaderboard Rotation: Race Condition

Problem: At midnight, rotate weekly leaderboard

Naive approach (WRONG):
  At 00:00:
    1. ZRANGEBYSCORE → snapshot top 10K to PostgreSQL
    2. DEL lb:weekly
    3. New week begins
  Race: score update arrives between step 1 and 2 → lost forever

Correct approach: Key rotation with grace period ⭐
  Leaderboard key includes time bucket: lb:weekly:2026-W11
  At 00:00:
    1. Update "current week" pointer: SET current_lb_week "2026-W12"
    2. All NEW scores go to lb:weekly:2026-W12
    3. Background job: snapshot old to PostgreSQL
    4. After snapshot confirmed: DEL old key
  ✓ No race: "current week" atomically switches with SET

Cache Stampede

Many users request top-K simultaneously. Solution: Read from Redis replicas (eventual consistency OK). If Redis master fails → replica promotes in ~seconds.

Score Tampering

Server validates scores (never trust client-submitted scores). Anti-cheat: anomaly detection (score increase too fast → flag + review).

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.