This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Video Conferencing System (like Zoom).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: SFU vs MCU media servers, WebRTC data channels, Bandwidth estimation (REMB/TWCC)? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- SFU vs MCU media servers
- WebRTC data channels
- Bandwidth estimation (REMB/TWCC)
- Simulcast/SVC layers
- Screen sharing
- Recording pipeline
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for video conferencing zoom in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- One-on-one and group video/audio calls (up to 1,000 participants)
- Screen sharing with annotation support
- Real-time chat during calls (text, file sharing)
- Meeting scheduling with calendar integration
- Meeting recording with cloud storage and playback
- Virtual backgrounds, noise cancellation
- Breakout rooms for large meetings
- Waiting room with host admission control
- Raise hand, reactions, polls during meetings
- Join via browser (WebRTC), desktop app, or phone (PSTN dial-in)
- End-to-end encryption for 1:1 calls
- Ultra-Low Latency: < 150ms glass-to-glass for acceptable experience; < 400ms tolerable
- High Availability: 99.99%: outages during calls are catastrophic
- Scalability: 100M+ concurrent users, 10M+ concurrent meetings
- Adaptive Quality: Gracefully degrade on poor networks
- Global: Edge servers in 50+ regions to minimize round-trip
- Reliability: No dropped frames under normal conditions; reconnect within 2 seconds
- Security: E2E encryption for 1:1, TLS for group calls
| Metric | Calculation | Value |
|---|---|---|
| Concurrent meetings | Given (peak load assumption) | 10M |
| Avg participants per meeting | Given (typical workload assumption) | 5 |
| Concurrent streams | Given (peak load assumption) | 50M |
| Bandwidth per participant | Given (assumption documented in value) | 2 Mbps down + 1.5 Mbps up (720p) |
| Total bandwidth | 50M × 3.5 Mbps | 175 Tbps → Edge distribution critical |
| Recording storage / day | 5M meetings × 30 min × 500 MB/hr | 1.25 PB/day |
| Signaling messages / sec | 10M meetings × 2 signals/sec | 20M/sec |
Participants connect to signaling servers (WebSocket) for session management, then exchange media with SFU (Selective Forwarding Unit) servers via SRTP. TURN servers relay media when NAT traversal fails. Recording bots join as invisible participants.
Why SFU Over MCU and Mesh?
| Topology | Pros | Cons | Use When |
|---|---|---|---|
| Mesh | No server cost, lowest latency for 1:1 | N² streams, unscalable past 4-5 users | 1:1 calls |
| MCU | Client uploads 1, downloads 1 stream | Massive server CPU (decode+encode), no per-client quality | Legacy telephony |
| SFU ✅ | Low server CPU (no decode), per-client quality selection, simulcast | More downstream bandwidth than MCU | 2+ participant calls |
Zoom/Google Meet/Teams all use SFU because: server doesn't decode (10× less CPU than MCU), simulcast allows per-client quality adaptation, and can selectively forward only visible/speaking participants.
Simulcast & SVC
Simulcast (used by most): Sender encodes 3 independent streams (High: 1080p @ 2.5 Mbps, Medium: 720p @ 1 Mbps, Low: 180p @ 150 Kbps). SFU selects per receiver: active speaker gets High from speaker, Low from others; gallery view gets Medium from all; mobile on 3G gets Low.
SVC: Single layered stream (base + enhancement layers). SFU drops upper layers for constrained receivers. Advantage: single encode, seamless quality transitions. Used by Google Meet (VP9-SVC), Zoom (newer versions).
SFU Architecture & Cascading
Each meeting has a primary SFU and hot-standby. Hot-standby receives participant list + ICE candidates from signaling. On primary failure: signaling detects (heartbeat miss, < 3 sec), signals clients to reconnect to backup, clients perform ICE restart (~1-2 sec).
Multi-region cascading: Meeting with participants in US-East, EU-West, APAC → SFU-US-East, SFU-EU-West, SFU-APAC each fan-out locally. Each stream crosses region ONCE via SFU-to-SFU relay. Bandwidth: 30 Mbps inter-region vs 870 Mbps if all direct (29× saving).
Network Quality Adaptation
Client monitors packet loss, RTT, and available bandwidth. Loss < 2% → 1080p 30fps. Loss 2-5% → 720p. Loss 5-10% → 360p 15fps. Loss > 10% → audio-only. FEC: Send redundant packets (10% overhead) for recovery without retransmission. NACK: Request retransmission from SFU jitter buffer (only useful if RTT < 100ms).
Large Meeting Optimization (1000+ participants)
Tiered architecture: Tier 1 Speakers (5-10) → Full SFU mesh, send+receive video. Tier 2 Active participants → Receive video, can unmute. Tier 3 Viewers → Receive-only via CDN (HLS, 5-10s delay). Reduces from 1000 WebRTC connections to ~10 WebRTC + 990 CDN viewers.
REST APIs
POST /api/meetings → Create/schedule meeting
GET /api/meetings/{meeting_id} → Get meeting details
POST /api/meetings/{meeting_id}/join → Join meeting (get SFU endpoint)
POST /api/meetings/{meeting_id}/record → Start/stop recording
POST /api/meetings/{meeting_id}/breakout → Create breakout roomsWebSocket Signaling Protocol
// Join meeting
{ "type": "join", "meeting_id": "m_123", "token": "jwt...", "media": {"audio": true, "video": true} }
// SDP Offer/Answer (WebRTC handshake)
{ "type": "offer", "sdp": "v=0...", "target": "sfu" }
{ "type": "answer", "sdp": "...", "from": "sfu" }
// ICE candidate exchange
{ "type": "ice_candidate", "candidate": "candidate:... udp ...", "sdpMid": "0" }
// Meeting controls
{ "type": "mute", "target_user": "u_456", "media": "audio" }
{ "type": "raise_hand" }
{ "type": "screen_share_start", "stream_id": "screen_1" }
// Participant events (server → client)
{ "type": "participant_joined", "user": {"id": "u_789", "name": "Alice"} }
{ "type": "active_speaker", "user_id": "u_456" }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 440 Login Timeout: WebSocket session expired; reconnect required
PostgreSQL
CREATE TABLE meetings (
meeting_id UUID PRIMARY KEY,
host_id UUID NOT NULL,
title TEXT,
scheduled_start TIMESTAMPTZ,
actual_start TIMESTAMPTZ,
actual_end TIMESTAMPTZ,
status TEXT DEFAULT 'scheduled',
password TEXT,
settings JSONB,
max_participants INT DEFAULT 100,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE meeting_participants (
meeting_id UUID REFERENCES meetings(meeting_id),
user_id UUID,
display_name TEXT,
join_time TIMESTAMPTZ,
leave_time TIMESTAMPTZ,
role TEXT DEFAULT 'participant',
PRIMARY KEY (meeting_id, user_id, join_time)
);Redis: Active Meeting State
HSET meeting:active:{meeting_id} host "u_123" sfu_endpoint "sfu-us-east-3.zoom.us:443" participant_count 15
ZADD meeting:participants:{meeting_id} {join_timestamp} {user_id}
SET turn:alloc:{user_id}:{meeting_id} "turn-server-17:3478" EX 3600SFU Failure Recovery
Each meeting has a primary SFU and hot-standby. On primary failure: signaling detects (heartbeat miss, < 3 sec), signals all clients to reconnect to backup, clients perform ICE restart (~1-2s). "Warm" SFU optimization: standby already has ICE candidates cached → skip full ICE negotiation → reconnect in < 1 sec.
End-to-End Encryption (E2E)
Challenge: SFU needs RTP headers for routing but shouldn't decrypt media. Solution: Insertable Streams API (SFrame): client encrypts media payload before RTP encoding. RTP headers remain unencrypted for routing. Key exchange via Diffie-Hellman over signaling channel. Limitation: only works for small meetings, recording impossible without trusted recorder, server-side features must be client-side.
Recording Architecture
Recording bot joins as invisible participant on SFU, receives all streams, performs real-time compositing (active speaker layout or gallery view), writes chunks to local SSD. On meeting end: upload to S3, trigger transcoding, generate searchable transcript via speech-to-text.
Noise Cancellation & Virtual Background
Run ML models locally on client (RNNoise for audio, MediaPipe for segmentation). Client-side processing: no server involvement, privacy preserved. GPU acceleration via WebGPU/WebGL.
Why WebRTC Specifically?
WebRTC provides: NAT traversal (ICE/STUN/TURN), encrypted media (SRTP/DTLS), adaptive bitrate (built-in bandwidth estimation), codec negotiation (H.264/VP8/VP9/AV1 via SDP), browser-native (no plugins), and sub-second latency. Alternative: custom UDP protocol (Zoom's original approach): even lower latency but doesn't work in browsers without plugin.
Meeting Quality Monitoring
Every 5 seconds, each client reports packet loss, jitter, RTT, resolution, CPU usage, encode time, and available bandwidth. Backend aggregates in ClickHouse for real-time dashboards, alerts (if > 20% participants have poor quality), regional analysis, and historical 95th percentile quality by region.
Interview Walkthrough
- Separate signaling (WebSocket over TCP, must be reliable) from media transport (UDP/SRTP, loss-tolerant) — interviewers expect this split immediately.
- Recommend SFU over mesh and MCU for 2+ participants: server forwards packets without decode/encode, enabling simulcast and per-client quality selection.
- Explain simulcast: sender encodes 3 layers (1080p/720p/180p); SFU forwards the appropriate layer per receiver based on bandwidth estimation.
- Regional SFU cascading for global meetings — each stream crosses a region boundary once via SFU-to-SFU relay, not N² direct connections.
- Large meetings (1000+): tier active speakers on WebRTC, route passive viewers through CDN (HLS) to cap connection count.
- Network adaptation: packet loss thresholds trigger resolution downgrade; FEC and NACK for recovery; audio-only fallback above 10% loss.
- Common pitfall: choosing MCU because clients upload only one stream — server must decode and re-encode N streams, making CPU the bottleneck at scale.
P2P (Mesh) vs SFU vs MCU: Media Routing Architecture
P2P (Mesh): Zero server cost, true E2E encryption. Client upload = N-1 streams. Fails at > 4-5 participants (bandwidth explosion). Use for 1:1 calls.
SFU ⭐: Client upload = 1 stream (fixed). Server just forwards packets (no transcoding). Near-realtime (< 150ms). Client download still scales with N (at 100 participants → 99 streams). Use for 2-1000 participants.
MCU: Client download = 1 stream (works on weak devices). Server compute: N decoders + N encoders per meeting (EXPENSIVE). Adds 500-1000ms latency. Use for PSTN integration and weak-client meetings.
Zoom's hybrid: < 25 participants → SFU (high quality). > 25 participants → MCU for compositing. Large webinars → MCU for panelists, SFU for attendees.
UDP vs TCP for Media Transport
TCP retransmits lost packets, which blocks subsequent packets → audio/video FREEZES for 100-300ms.
UDP with custom loss handling: FEC or just skip! Lost video packet = one macroblock corrupted for 1 frame (barely noticeable). Lost audio (20ms) → PLC fills in (inaudible). Opus codec has built-in FEC.
Rule: Use UDP with application-level FEC and PLC for media. Use TCP/QUIC for signaling (must be reliable).
Simulcast: Serving All Clients at Their Optimal Quality
Each sender encodes 3 resolutions simultaneously (180p, 360p, 720p). SFU forwards appropriate layer per receiver based on bandwidth estimation (RTCP feedback). Sender uploads ~50% more (2.2 Mbps vs 1.5 Mbps). Benefit: each receiver gets optimal quality. SVC alternative: single layered bitstream, one upload stream, but more complex codec. VP9/AV1 support SVC natively. Zoom uses simulcast (H.264 compatibility).
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core video conferencing zoom flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.