Design a Video Conferencing System (like Zoom)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Video Conferencing System (like Zoom).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: SFU vs MCU media servers, WebRTC data channels, Bandwidth estimation (REMB/TWCC)?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

SFU vs MCU media servers
WebRTC data channels
Bandwidth estimation (REMB/TWCC)
Simulcast/SVC layers
Screen sharing
Recording pipeline

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for video conferencing zoom in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

One-on-one and group video/audio calls (up to 1,000 participants)
Screen sharing with annotation support
Real-time chat during calls (text, file sharing)
Meeting scheduling with calendar integration
Meeting recording with cloud storage and playback
Virtual backgrounds, noise cancellation
Breakout rooms for large meetings
Waiting room with host admission control
Raise hand, reactions, polls during meetings
Join via browser (WebRTC), desktop app, or phone (PSTN dial-in)
End-to-end encryption for 1:1 calls

Metric	Calculation	Value
Concurrent meetings	Given (peak load assumption)	10M
Avg participants per meeting	Given (typical workload assumption)	5
Concurrent streams	Given (peak load assumption)	50M
Bandwidth per participant	Given (assumption documented in value)	2 Mbps down + 1.5 Mbps up (720p)
Total bandwidth	50M × 3.5 Mbps	175 Tbps → Edge distribution critical
Recording storage / day	5M meetings × 30 min × 500 MB/hr	1.25 PB/day
Signaling messages / sec	10M meetings × 2 signals/sec	20M/sec

Participants connect to signaling servers (WebSocket) for session management, then exchange media with SFU (Selective Forwarding Unit) servers via SRTP. TURN servers relay media when NAT traversal fails. Recording bots join as invisible participants.

Loading...

Why SFU Over MCU and Mesh?

Topology	Pros	Cons	Use When
Mesh	No server cost, lowest latency for 1:1	N² streams, unscalable past 4-5 users	1:1 calls
MCU	Client uploads 1, downloads 1 stream	Massive server CPU (decode+encode), no per-client quality	Legacy telephony
SFU ✅	Low server CPU (no decode), per-client quality selection, simulcast	More downstream bandwidth than MCU	2+ participant calls

Zoom/Google Meet/Teams all use SFU because: server doesn't decode (10× less CPU than MCU), simulcast allows per-client quality adaptation, and can selectively forward only visible/speaking participants.

Simulcast & SVC

Simulcast (used by most): Sender encodes 3 independent streams (High: 1080p @ 2.5 Mbps, Medium: 720p @ 1 Mbps, Low: 180p @ 150 Kbps). SFU selects per receiver: active speaker gets High from speaker, Low from others; gallery view gets Medium from all; mobile on 3G gets Low.

SVC: Single layered stream (base + enhancement layers). SFU drops upper layers for constrained receivers. Advantage: single encode, seamless quality transitions. Used by Google Meet (VP9-SVC), Zoom (newer versions).

SFU Architecture & Cascading

Each meeting has a primary SFU and hot-standby. Hot-standby receives participant list + ICE candidates from signaling. On primary failure: signaling detects (heartbeat miss, < 3 sec), signals clients to reconnect to backup, clients perform ICE restart (~1-2 sec).

Multi-region cascading: Meeting with participants in US-East, EU-West, APAC → SFU-US-East, SFU-EU-West, SFU-APAC each fan-out locally. Each stream crosses region ONCE via SFU-to-SFU relay. Bandwidth: 30 Mbps inter-region vs 870 Mbps if all direct (29× saving).

Network Quality Adaptation

Client monitors packet loss, RTT, and available bandwidth. Loss < 2% → 1080p 30fps. Loss 2-5% → 720p. Loss 5-10% → 360p 15fps. Loss > 10% → audio-only. FEC: Send redundant packets (10% overhead) for recovery without retransmission. NACK: Request retransmission from SFU jitter buffer (only useful if RTT < 100ms).

Large Meeting Optimization (1000+ participants)

Tiered architecture: Tier 1 Speakers (5-10) → Full SFU mesh, send+receive video. Tier 2 Active participants → Receive video, can unmute. Tier 3 Viewers → Receive-only via CDN (HLS, 5-10s delay). Reduces from 1000 WebRTC connections to ~10 WebRTC + 990 CDN viewers.

REST APIs

POST   /api/meetings                        → Create/schedule meeting
GET    /api/meetings/{meeting_id}           → Get meeting details
POST   /api/meetings/{meeting_id}/join      → Join meeting (get SFU endpoint)
POST   /api/meetings/{meeting_id}/record    → Start/stop recording
POST   /api/meetings/{meeting_id}/breakout  → Create breakout rooms

WebSocket Signaling Protocol

JSON

// Join meeting
{ "type": "join", "meeting_id": "m_123", "token": "jwt...", "media": {"audio": true, "video": true} }

// SDP Offer/Answer (WebRTC handshake)
{ "type": "offer", "sdp": "v=0...", "target": "sfu" }
{ "type": "answer", "sdp": "...", "from": "sfu" }

// ICE candidate exchange
{ "type": "ice_candidate", "candidate": "candidate:... udp ...", "sdpMid": "0" }

// Meeting controls
{ "type": "mute", "target_user": "u_456", "media": "audio" }
{ "type": "raise_hand" }
{ "type": "screen_share_start", "stream_id": "screen_1" }

// Participant events (server → client)
{ "type": "participant_joined", "user": {"id": "u_789", "name": "Alice"} }
{ "type": "active_speaker", "user_id": "u_456" }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
440 Login Timeout: WebSocket session expired; reconnect required

PostgreSQL

SQL

CREATE TABLE meetings (
    meeting_id      UUID PRIMARY KEY,
    host_id         UUID NOT NULL,
    title           TEXT,
    scheduled_start TIMESTAMPTZ,
    actual_start    TIMESTAMPTZ,
    actual_end      TIMESTAMPTZ,
    status          TEXT DEFAULT 'scheduled',
    password        TEXT,
    settings        JSONB,
    max_participants INT DEFAULT 100,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE meeting_participants (
    meeting_id   UUID REFERENCES meetings(meeting_id),
    user_id      UUID,
    display_name TEXT,
    join_time    TIMESTAMPTZ,
    leave_time   TIMESTAMPTZ,
    role         TEXT DEFAULT 'participant',
    PRIMARY KEY (meeting_id, user_id, join_time)
);

Redis: Active Meeting State

HSET meeting:active:{meeting_id} host "u_123" sfu_endpoint "sfu-us-east-3.zoom.us:443" participant_count 15
ZADD meeting:participants:{meeting_id} {join_timestamp} {user_id}
SET turn:alloc:{user_id}:{meeting_id} "turn-server-17:3478" EX 3600

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Why SFU Over MCU and Mesh?

Simulcast & SVC

SFU Architecture & Cascading

Network Quality Adaptation

Large Meeting Optimization (1000+ participants)

REST APIs

WebSocket Signaling Protocol

Common Error Responses

PostgreSQL

Redis: Active Meeting State

SFU Failure Recovery

End-to-End Encryption (E2E)

Recording Architecture

Noise Cancellation & Virtual Background

Why WebRTC Specifically?

Meeting Quality Monitoring

Interview Walkthrough

P2P (Mesh) vs SFU vs MCU: Media Routing Architecture

UDP vs TCP for Media Transport

Simulcast: Serving All Clients at Their Optimal Quality

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR