Interview Prompt
Design Multiplayer Game Backend.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Game loop tick rate, Client-side prediction, Server reconciliation? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Game loop tick rate
- Client-side prediction
- Server reconciliation
- Lag compensation
- Spatial partitioning (interest management)
- UDP vs TCP
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for multiplayer game backend in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Real-time game state synchronization (tick-based: 20-60 ticks/sec)
- Matchmaking: group players of similar skill into game sessions
- Lobby system: create/join rooms, invite friends, ready-up
- Player input processing with server-authoritative validation (anti-cheat)
- Game state persistence: save progress, inventory, stats
- Leaderboard: global and seasonal rankings
- In-game chat (text and voice)
- Replay system: record and playback full game sessions
- Spectator mode: watch live games with slight delay
- Cross-platform play (PC, console, mobile)
- Ultra-Low Latency: < 50ms round-trip (ideally < 20ms)
- Tick Rate: Server processes game state at 20-64 times per second
- High Availability: 99.9%: game server crash = game lost
- Consistency: Server is authoritative: all clients see same game state
- Scalability: 10M+ concurrent players, 500K+ sessions
- Anti-Cheat: Server validates all actions; client is untrusted
| Metric | Calculation | Value |
|---|---|---|
| Concurrent players | Given | 10M |
| Concurrent game sessions | Given | 500K |
| Tick rate | Given | 30 ticks/sec |
| Total state updates / sec | 10M players × 30 ticks/sec | 300M |
| Network per player | Given | ~150 KB/sec |
| Total bandwidth | Given | 1.5 TB/sec |
| Game servers needed | Given | ~50K |
Netcode: Client-Side Prediction + Server Reconciliation
Problem: 50ms round-trip -> 50ms delay between pressing "move" and seeing movement
Solution: Client-Side Prediction + Server Reconciliation
1. Player presses "move forward"
2. Client: immediately moves character locally (feels responsive)
3. Client: sends input to server (timestamp + input)
4. Server: simulates movement, sends authoritative position back
5. Client: receives server position at time T
- If matches prediction -> great, no correction needed
- If different -> rewind to server state at T, replay all inputs since TNetcode: Interpolation + Lag Compensation
Interpolation (other players):
Render other players ~50ms in the past (one tick behind)
Smooth interpolation between known positions
Result: smooth movement despite discrete server ticks
Lag Compensation (shooting):
Server-side rewind ("favor the shooter")
1. Player shoots at tick 100, sees world at tick 99
2. Server receives shot -> rewinds to tick 99
3. Checks if shot would hit at tick 99 -> YES -> register hit
Trade-off: shooter's experience prioritized, victim gets shot "around corners"Matchmaking: Glicko-2 Algorithm
Queue process: 1. Player enters queue with (rating, deviation, region) 2. Initially: search for exact match (+/-50 rating, same region) 3. Over time: widen search (+/-100, +/-200, then cross-region) 4. Quality score: 1 - (avg_rating_diff / max_diff) x (1 - wait_penalty) 5. Accept match when quality > threshold search_radius increases with wait time: 0-30s: +/-100 (tight, fair) 30-60s: +/-200 60-120s: +/-400 120s+: +/-1000 (match anyone)
State Compression (Bandwidth Optimization)
Delta compression: only send what changed since last acknowledged tick Player didn't move? Skip (0 bytes) Typical delta: 200 bytes vs 2 KB full state (10x reduction) Priority-based updates: Nearby players: full update every tick (30/sec) Far players: reduced rate (10/sec) Off-screen: minimal updates (2/sec) Result: 5-10x bandwidth reduction
Event Bus Design (Kafka)
Topic: multiplayer_game_backend-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "multiplayer_game_backend-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: multiplayer_game_backend-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Multiplayer Game Backend: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
# Matchmaking
POST /api/matchmaking/queue -> Enter queue {rating, region, party_ids}
DELETE /api/matchmaking/queue -> Leave queue
GET /api/matchmaking/status -> Queue status, estimated wait time
# Lobby
POST /api/lobbies -> Create lobby room
POST /api/lobbies/{id}/join -> Join lobby
POST /api/lobbies/{id}/ready -> Ready up
POST /api/lobbies/{id}/start -> Start game (host only)
# Game connection
{ "game_server": "gs-us-east-42.game.com", "port": 27015, "token": "jwt..." }
# Profile & Stats
GET /api/players/{id}/stats -> Win rate, K/D, matches played
# Leaderboard
GET /api/leaderboard?season=s12&page=1 -> Top playersCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 440 Login Timeout: WebSocket session expired; reconnect required
PostgreSQL: Persistent Player Data
CREATE TABLE players (
player_id UUID PRIMARY KEY, username TEXT UNIQUE,
rating FLOAT DEFAULT 1500.0, rating_dev FLOAT DEFAULT 350.0,
volatility FLOAT DEFAULT 0.06, matches_played INT DEFAULT 0,
wins INT DEFAULT 0, created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE match_history (
match_id UUID, player_id UUID, team INT,
result TEXT CHECK (result IN ('win','loss','draw')),
kills INT DEFAULT 0, deaths INT DEFAULT 0,
rating_change FLOAT, played_at TIMESTAMPTZ,
PRIMARY KEY (match_id, player_id)
);Redis + S3
# Redis
ZADD mm:queue:{region} {rating} {player_id} -- Matchmaking queue
HSET game:session:{game_id} server "gs-42" ... -- Active sessions
ZADD leaderboard:season:s12 {rating} {player_id} -- Rankings
# S3
Path: s3://replays/{match_id}.bin
Format: compressed tick log (protobuf per tick, gzipped)
Size: ~5 MB per 20-min game| Technique | Application |
|---|---|
| Agones orchestration | Pre-warmed server pool; auto-replace crashed pods |
| Health checks | Gateway monitors game servers; redirect on failure |
| UDP resilience | Lost packets -> interpolation fills gaps |
| Kafka RF=3 | Match results, events survive broker failure |
| Redis Cluster | Session state, leaderboard survives node failure |
Game Server Crash Recovery
Short matches (< 10 min): Game lost -> don't count for ranking Long/ranked matches: 1. Checkpoint game state every 30s to Redis 2. On crash -> spin up new server -> load checkpoint -> players reconnect 3. Resume from ~30 seconds ago (acceptable) Tournament matches: Shadow server receives all inputs in real-time On primary crash -> promote shadow -> seamless failover (less than 1s)
Anti-Cheat
Server-authoritative prevents: Speed hacks, teleporting, infinite health, item duplication, wall shooting Client-side cheats (harder): Wallhack: only send positions of players within view range Aimbot: statistical analysis (inhuman accuracy -> flag -> ban)
Why UDP Over TCP for Game State
TCP: Guaranteed delivery + ordering -> but if packet lost, everything waits UDP: No guarantees -> if packet lost, skip it, use next one Getting tick 101 is more important than retransmitting tick 99 Lost tick -> interpolation fills the gap -> player barely notices Use TCP for: chat messages, inventory changes, kill feed (reliable events) Use UDP for: position updates, state snapshots (real-time, loss-tolerant)
Game Server Orchestration with Agones
Agones (Kubernetes-native game server orchestrator): 1. Maintain pool of "Ready" game servers (pre-warmed containers) 2. Matchmaker calls Agones API: "allocate server in us-east" 3. Agones marks Ready -> Allocated -> returns IP:port 4. Players connect -> game begins 5. Game ends -> server transitions to Shutdown -> pod recycled
Interview Walkthrough
- Lead with server-authoritative design — the server owns physics, health, and inventory; clients send inputs, never state.
- UDP for position ticks (loss-tolerant, latest state wins); TCP for chat, inventory, and kill feed (must be reliable).
- Fixed tick loop with a strict budget (~33ms at 30 ticks/sec): receive → validate → simulate → snapshot → broadcast → record.
- Delta-encoded personalized snapshots — each client receives only what changed in their view, not the full world state.
- Entity interpolation on clients renders smooth movement between server ticks ~100ms behind real-time.
- Agones pre-warms game server pods; matchmaker allocates by region to minimize player-to-server latency.
- Common pitfall: using TCP for real-time movement — head-of-line blocking on one lost packet freezes all players until retransmit completes.
The Game Loop: One Tick End-to-End (33ms Budget)
Server tick rate: 30 ticks/sec -> 33ms per tick Tick 1042 processing: T=0ms: RECEIVE — Read all pending UDP packets (~60 inputs) T=2ms: VALIDATE — Check each input (speed, cooldown, collision) T=8ms: SIMULATE — Physics, projectiles, hit detection T=15ms: SNAPSHOT — Build game state for tick 1042 T=18ms: BROADCAST — Personalized snapshots with delta encoding T=22ms: RECORD — Append to replay buffer T=23ms: IDLE (10ms headroom for spikes) Total: ~23ms of 33ms budget. 10ms headroom for spikes.
Entity Interpolation: Making Other Players Move Smoothly
Without interpolation: choppy movement between ticks With interpolation: render BETWEEN two known server states Client renders BETWEEN two known server states (100ms behind real-time): T=0ms: player at (100, 50) — tick 100 T=11ms: player at (101.7, 50) — 1/3 between ticks T=22ms: player at (103.3, 50) — 2/3 between ticks T=33ms: player at (105, 50) — tick 101 Trade-off: other players rendered ~100ms behind real-time. Competitive games use tick rates of 64-128 (Valorant: 128-tick = 7.8ms).
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core multiplayer game backend flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.