Design a Multiplayer Game Backend

Interview Prompt

Design Multiplayer Game Backend.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Game loop tick rate, Client-side prediction, Server reconciliation?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Game loop tick rate
Client-side prediction
Server reconciliation
Lag compensation
Spatial partitioning (interest management)
UDP vs TCP

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for multiplayer game backend in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Real-time game state synchronization (tick-based: 20-60 ticks/sec)
Matchmaking: group players of similar skill into game sessions
Lobby system: create/join rooms, invite friends, ready-up
Player input processing with server-authoritative validation (anti-cheat)
Game state persistence: save progress, inventory, stats
Leaderboard: global and seasonal rankings
In-game chat (text and voice)
Replay system: record and playback full game sessions
Spectator mode: watch live games with slight delay
Cross-platform play (PC, console, mobile)

Metric	Calculation	Value
Concurrent players	Given	10M
Concurrent game sessions	Given	500K
Tick rate	Given	30 ticks/sec
Total state updates / sec	10M players × 30 ticks/sec	300M
Network per player	Given	~150 KB/sec
Total bandwidth	Given	1.5 TB/sec
Game servers needed	Given	~50K

Loading...

Netcode: Client-Side Prediction + Server Reconciliation

Problem: 50ms round-trip -> 50ms delay between pressing "move" and seeing movement

Solution: Client-Side Prediction + Server Reconciliation
  1. Player presses "move forward"
  2. Client: immediately moves character locally (feels responsive)
  3. Client: sends input to server (timestamp + input)
  4. Server: simulates movement, sends authoritative position back
  5. Client: receives server position at time T
     - If matches prediction -> great, no correction needed
     - If different -> rewind to server state at T, replay all inputs since T

Netcode: Interpolation + Lag Compensation

Interpolation (other players):
  Render other players ~50ms in the past (one tick behind)
  Smooth interpolation between known positions
  Result: smooth movement despite discrete server ticks

Lag Compensation (shooting):
  Server-side rewind ("favor the shooter")
  1. Player shoots at tick 100, sees world at tick 99
  2. Server receives shot -> rewinds to tick 99
  3. Checks if shot would hit at tick 99 -> YES -> register hit
  Trade-off: shooter's experience prioritized, victim gets shot "around corners"

Matchmaking: Glicko-2 Algorithm

Queue process:
  1. Player enters queue with (rating, deviation, region)
  2. Initially: search for exact match (+/-50 rating, same region)
  3. Over time: widen search (+/-100, +/-200, then cross-region)
  4. Quality score: 1 - (avg_rating_diff / max_diff) x (1 - wait_penalty)
  5. Accept match when quality > threshold

search_radius increases with wait time:
  0-30s: +/-100 (tight, fair)
  30-60s: +/-200
  60-120s: +/-400
  120s+: +/-1000 (match anyone)

State Compression (Bandwidth Optimization)

Delta compression: only send what changed since last acknowledged tick
  Player didn't move? Skip (0 bytes)
  Typical delta: 200 bytes vs 2 KB full state (10x reduction)

Priority-based updates:
  Nearby players: full update every tick (30/sec)
  Far players: reduced rate (10/sec)
  Off-screen: minimal updates (2/sec)
  Result: 5-10x bandwidth reduction

Event Bus Design (Kafka)

Topic: multiplayer_game_backend-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "multiplayer_game_backend-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: multiplayer_game_backend-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Multiplayer Game Backend: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

PostgreSQL: Persistent Player Data

SQL

CREATE TABLE players (
    player_id UUID PRIMARY KEY, username TEXT UNIQUE,
    rating FLOAT DEFAULT 1500.0, rating_dev FLOAT DEFAULT 350.0,
    volatility FLOAT DEFAULT 0.06, matches_played INT DEFAULT 0,
    wins INT DEFAULT 0, created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE match_history (
    match_id UUID, player_id UUID, team INT,
    result TEXT CHECK (result IN ('win','loss','draw')),
    kills INT DEFAULT 0, deaths INT DEFAULT 0,
    rating_change FLOAT, played_at TIMESTAMPTZ,
    PRIMARY KEY (match_id, player_id)
);

Redis + S3

# Redis
ZADD mm:queue:{region} {rating} {player_id}     -- Matchmaking queue
HSET game:session:{game_id} server "gs-42" ...   -- Active sessions
ZADD leaderboard:season:s12 {rating} {player_id}  -- Rankings

# S3
Path: s3://replays/{match_id}.bin
Format: compressed tick log (protobuf per tick, gzipped)
Size: ~5 MB per 20-min game

Technique	Application
Agones orchestration	Pre-warmed server pool; auto-replace crashed pods
Health checks	Gateway monitors game servers; redirect on failure
UDP resilience	Lost packets -> interpolation fills gaps
Kafka RF=3	Match results, events survive broker failure
Redis Cluster	Session state, leaderboard survives node failure

Game Server Crash Recovery

Short matches (< 10 min): Game lost -> don't count for ranking

Long/ranked matches:
  1. Checkpoint game state every 30s to Redis
  2. On crash -> spin up new server -> load checkpoint -> players reconnect
  3. Resume from ~30 seconds ago (acceptable)

Tournament matches:
  Shadow server receives all inputs in real-time
  On primary crash -> promote shadow -> seamless failover (less than 1s)

Anti-Cheat

Server-authoritative prevents:
  Speed hacks, teleporting, infinite health, item duplication, wall shooting

Client-side cheats (harder):
  Wallhack: only send positions of players within view range
  Aimbot: statistical analysis (inhuman accuracy -> flag -> ban)

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Netcode: Client-Side Prediction + Server Reconciliation

Netcode: Interpolation + Lag Compensation

Matchmaking: Glicko-2 Algorithm

State Compression (Bandwidth Optimization)

Event Bus Design (Kafka)

Common Error Responses

PostgreSQL: Persistent Player Data

Redis + S3

Game Server Crash Recovery

Anti-Cheat

Why UDP Over TCP for Game State

Game Server Orchestration with Agones

Interview Walkthrough

The Game Loop: One Tick End-to-End (33ms Budget)

Entity Interpolation: Making Other Players Move Smoothly

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR