Interview Prompt
Design Leaderboard System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Redis sorted sets, Rank queries, Sharded leaderboards? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Redis sorted sets
- Rank queries
- Sharded leaderboards
- Time-scoped boards
- Tie-breaking
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full catalog/search infrastructure (#12)
- Payment checkout flow (#24)
- Fraud and abuse ML pipelines
Assumptions
- Clarify scale (DAU, QPS, data volume) for leaderboard system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Real-time global leaderboard showing top-K users by score
- User can query their own rank among millions of players
- Update scores in real time (increment, set, decay)
- Multiple leaderboards: daily, weekly, seasonal, all-time, per-game-mode
- Friends leaderboard: rank among friends only
- Historical leaderboards: view past week/season final standings
- Tie-breaking: same score → earlier achievement time ranks higher
- Percentile ranking: "You are in the top 5% of all players"
- Low Latency: Top-K query < 10ms; rank lookup < 50ms
- High Write Throughput: 100K score updates/sec during peak
- Consistency: Eventual consistency OK (1-2 second delay acceptable)
- Scalability: 100M+ players per leaderboard
- Availability: 99.99%: leaderboard is core game feature
| Metric | Calculation | Value |
|---|---|---|
| Total players | Given (assumption documented in value) | 100M |
| Active players per leaderboard | Given (assumption documented in value) | 10M |
| Score updates / sec (peak) | From Score updates / day ÷ 86400 (+ peak factor in value) | 100K |
| Top-K queries / sec | From Top-K queries / day ÷ 86400 (+ peak factor in value) | 50K |
| Rank lookup queries / sec | From Rank lookup queries / day ÷ 86400 (+ peak factor in value) | 200K |
| Leaderboards (total) | Given | ~1,000 |
| Memory per leaderboard | Given | ~1.5 GB |
Redis Sorted Set: Core Data Structure
ZADD lb:weekly {score} {user_id} → O(log N) insert/update
ZREVRANK lb:weekly {user_id} → O(log N) get user's rank
ZREVRANGE lb:weekly 0 99 WITHSCORES → O(log N + K) top K
ZINCRBY lb:weekly {delta} {user_id} → O(log N) atomic increment
ZREVRANGE lb:weekly rank-5 rank+5 → O(log N + K) neighbors
Internal: Skip List — balanced probabilistic data structure
10M members: ~20 operations per query (log₂(10M) ≈ 23)
Each operation: ~1µs → total: ~20µs per query
Memory: ~1.5 GB for 10M entries (key + score + overhead)Tie-Breaking Mechanism
Problem: Two users have score=1000 → who ranks higher?
Rule: Earlier achievement = higher rank
Implementation: Encode timestamp into the score
score_key = score × 10^10 + (MAX_TIMESTAMP - actual_timestamp)
Example (MAX_TIMESTAMP = 9999999999, must be < 10^10 so it never
disturbs the score ordering):
User A: score=1000 at t=100 → key = 1000_0000000000 + 9999999899
User B: score=1000 at t=200 → key = 1000_0000000000 + 9999999799
User A's key > User B's key → User A ranks higher ✓Scaling Beyond 100M Members
Approach 1: Single Redis Sorted Set (recommended up to ~100M) 100M × 150 bytes = 15 GB → fits in single large Redis instance All operations still O(log N) ≈ O(27) Approach 2: Sharded by Score Range (for > 100M) Shard 0: scores 0-999 → Redis instance 0 Shard N: scores 9000-9999 → Redis instance N Top-K: query highest shard → get top from that shard Rank: sum counts in all higher shards + ZREVRANK within shard Approach 3: Fenwick Tree / Binary Indexed Tree (for percentile rank) Score range [0, MAX_SCORE] → array of counts rank(score) = prefix_sum(MAX_SCORE) - prefix_sum(score) O(log S) where S = max score range, regardless of user count
Friends Leaderboard
Fetch all friend IDs from social graph, ZSCORE lb:weekly {friend_id} for each friend (pipelined), sort in application. Cost: O(F) where F = friend count.
Percentile Rank at Scale
Approach 1: ZCARD + ZREVRANK (exact, two Redis calls). Approach 2: Maintain 1000 score buckets, each tracking count. percentile = sum(buckets above user's score) / total_users. O(1) lookup, refreshed every 1 minute.
Event Bus Design (Kafka)
Topic: leaderboard_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "leaderboard_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: leaderboard_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Leaderboard System: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
# Score updates
POST /api/leaderboard/{lb_id}/score
{ "user_id": "u_123", "score_delta": 50 }
# Read operations
GET /api/leaderboard/{lb_id}/top?k=100 → Top K players
GET /api/leaderboard/{lb_id}/rank/{user_id} → User's rank + score
GET /api/leaderboard/{lb_id}/around/{user_id}?n=10 → 10 above + 10 below
GET /api/leaderboard/{lb_id}/friends/{user_id} → Rank among friends
GET /api/leaderboard/{lb_id}/percentile/{user_id} → "Top 5%" info
# Historical
GET /api/leaderboard/{lb_id}/history?date=2026-03-01 → Archived standingsCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
-- PostgreSQL (Historical Snapshots)
CREATE TABLE leaderboard_snapshots (
leaderboard_id TEXT,
snapshot_date DATE,
rank INT,
user_id UUID,
score BIGINT,
PRIMARY KEY (leaderboard_id, snapshot_date, rank)
);
-- Redis (Active Leaderboard)
-- ZADD lb:weekly:game1 {score_with_tiebreak} {user_id}
-- HSET user:display:{user_id} name "Alice" avatar_url "..." level 42Kafka (Score Events)
Topic: score-events
Key: user_id
Value: { "lb_id": "weekly:game1", "user_id": "u_123", "delta": 50,
"new_score": 1250, "timestamp": "..." }| Technique | Application |
|---|---|
| Redis persistence | RDB snapshots + AOF → recover on restart |
| Redis Cluster | 3 masters + 3 replicas; auto-failover |
| Rebuild from Kafka | If Redis lost → replay score events → rebuild leaderboard |
| Read replicas | Separate read replicas for top-K (heavy read load) |
| Leaderboard rotation | Weekly reset → archive to PostgreSQL → DEL key |
Leaderboard Rotation: Race Condition
Problem: At midnight, rotate weekly leaderboard
Naive approach (WRONG):
At 00:00:
1. ZRANGEBYSCORE → snapshot top 10K to PostgreSQL
2. DEL lb:weekly
3. New week begins
Race: score update arrives between step 1 and 2 → lost forever
Correct approach: Key rotation with grace period ⭐
Leaderboard key includes time bucket: lb:weekly:2026-W11
At 00:00:
1. Update "current week" pointer: SET current_lb_week "2026-W12"
2. All NEW scores go to lb:weekly:2026-W12
3. Background job: snapshot old to PostgreSQL
4. After snapshot confirmed: DEL old key
✓ No race: "current week" atomically switches with SETCache Stampede
Many users request top-K simultaneously. Solution: Read from Redis replicas (eventual consistency OK). If Redis master fails → replica promotes in ~seconds.
Score Tampering
Server validates scores (never trust client-submitted scores). Anti-cheat: anomaly detection (score increase too fast → flag + review).
Score Decay
Inactive players sit at top forever → leaderboard feels stale. Solutions: Separate leaderboards with time windows (weekly resets), daily decay job (ZINCRBY), or recency-weighted score.
Interview Walkthrough
- Explain Redis sorted set (ZADD/ZREVRANK) as the canonical leaderboard structure for O(log N) updates.
- Cover sharded leaderboards for global vs friends vs weekly scopes.
- Discuss approximate rank at billion-player scale via HyperLogLog or bucketed tiers.
- Mention async score updates from game events via queue — don't block gameplay on leaderboard write.
- Cover tie-breaking policy (timestamp, user_id) and state it explicitly.
- Common pitfall: SQL ORDER BY on every read — relational DB cannot serve real-time global rankings at scale.
Why Redis Sorted Set Wins
| Solution | Top-K | User Rank | Update | Memory | Complexity | |---|---|---|---|---|---|---| | MySQL ORDER BY score | O(N log N) | O(N) | O(log N) | Disk | High | | Redis Sorted Set ⭐ | O(log N + K) | O(log N) | O(log N) | RAM | Low | | Fenwick Tree | O(S) | O(log S) | O(log S) | RAM | Medium | | Pre-computed ranks | O(1) | O(1) | O(N) recompute | RAM | High | Redis Sorted Set wins because: 1. O(log N) for ALL operations — no trade-offs 2. Built-in ZREVRANK for instant rank lookup 3. ZREVRANGE for top-K in one command 4. ZINCRBY for atomic score updates 5. Fits in memory for realistic user counts (100M = 15 GB)
Multi-Game, Multi-Region at Scale
2,500 leaderboards × 1.5 GB avg = 3.75 TB. Won't fit in single Redis cluster. Solution: Shard by leaderboard_id (consistent hashing). Hot leaderboards (> 1M active): dedicated Redis shard. Cold (< 100K): shared.
Percentile at 1B+ Players
Score Histogram: 1000 buckets × 8 bytes = 8 KB. Accuracy ±0.1%. Update: decrement old bucket, increment new. Precompute prefix sums → O(1) lookup. This is how games like Candy Crush show "Top 5%" without 15 GB Redis sorted sets.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core leaderboard system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.