This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | VOD-first (YouTube/Netflix style). Nail HLS vs DASH segment model, transcoding ladder, CDN edge caching with pre-signed URLs, and view-count aggregation without blocking playback. |
| Arch 50 | Add live streaming (LL-HLS), DRM key delivery, per-title encoding, and cold-start / viral video hot-key mitigation on origin. |
| Arch 75 | Staff: multi-CDN failover, cost of egress at 1B hours/month, and how to migrate encoding profiles without re-transcoding the entire library. |
Interview Prompt
Design a video streaming platform like YouTube or Netflix. Users upload videos, the platform transcodes them into multiple quality levels, and viewers stream with adaptive bitrate playback. Support upload, playback, and view count analytics.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| VOD, live, or both? | VOD is upload → transcode → CDN cache. Live adds ingest RTMP/SRT, low-latency segment publishing, and no full-file pre-transcode. |
| What's the average video length and upload volume? | 10-min avg × 500K uploads/day = transcoding farm sizing. 1-hour 4K uploads dominate queue depth. |
| Do we need DRM or is signed URLs enough? | Premium content needs Widevine/FairPlay license server. User-generated content often uses time-limited pre-signed CDN URLs. |
| View counts — real-time or eventually consistent? | Real-time counters create hot keys on viral videos. Batch aggregation (Kafka → Flink) is standard for display counts. |
Scope
In scope
- Video upload, transcoding pipeline, and metadata storage
- Adaptive bitrate playback (HLS/DASH)
- CDN edge delivery with pre-signed URLs
- View count aggregation
- Capacity estimation for storage and egress
Out of scope (state explicitly)
- Recommendation / home feed ranking (#48, #65)
- Live chat and comments (#36)
- Full DRM license server internals
- Content moderation pipeline (#81)
Assumptions
- 500K new uploads/day, avg 10 min, 1080p source
- 200M DAU, avg 1 hour watch time/day
- 99.9% playback availability; upload can retry
- 5 renditions per video: 360p–1080p + audio-only
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Upload videos: Users upload videos with title, description, tags, thumbnails
- Stream videos: Adaptive bitrate streaming (adjust quality based on bandwidth)
- Search videos: By title, description, tags, channel
- Recommendations: Personalized "what to watch next"
- Channels/Subscriptions: Subscribe to channels, receive updates
- Engagement: Like, dislike, comment, share
- Watch history and resume playback
- Live streaming (optional: YouTube Live)
- Monetization: Ads, premium subscriptions
- High Availability: 99.99%
- Low Startup Latency: Video starts playing within 2 seconds
- Smooth Playback: No buffering under normal network conditions
- Scalability: 2B+ monthly active users, 500 hours of video uploaded per minute
- Durability: Uploaded videos must never be lost
- Global: Low latency streaming worldwide (CDN)
- Cost Efficient: Video storage and bandwidth are the biggest costs
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given (product assumption) | 800M |
| Videos watched / day | 800M DAU × ~6 videos | 5B |
| Avg video duration | Given (typical workload assumption) | 5 min |
| Streaming bandwidth / video | Given (assumption documented in value) | 5 Mbps (avg bitrate) |
| Concurrent viewers | Given (peak load assumption) | 100M |
| Peak bandwidth | 100M × 5 Mbps | 500 Tbps |
| Videos uploaded / day | 500 hrs/min × 60 min × 24 | 720K |
| Avg original video size | Given (typical workload assumption) | 500 MB |
| Upload storage / day | 720K × 500 MB | 360 TB |
| Transcoded versions (6 resolutions) | 360 TB × 3 | ~1 PB/day |
| Total storage (existing) | Given | ~1 EB (exabyte) |
Upload Flow
- Client requests a pre-signed upload URL from Upload Service
- Client uploads video directly to S3 (chunked upload for large files)
- Upload Service creates a metadata record (status = "processing")
- Publishes video-uploaded event to Kafka
- Video Processing Pipeline picks up the event
Video Processing Pipeline: The Most Complex Part
Step 1: Transcoding: Convert original video to multiple resolutions and bitrates: 240p (400 Kbps) through 4K (20 Mbps). Codec: H.264 (broad compatibility), H.265/HEVC (50% better compression), or AV1 (royalty-free, best compression). Parallel processing: split video into 10-second segments, transcode each in parallel, then reassemble.
Step 2: Adaptive Bitrate Streaming Packaging: HLS (HTTP Live Streaming) uses Master Playlist (.m3u8) referencing quality playlists. Client downloads master playlist, selects quality based on available bandwidth. If bandwidth drops, client switches to lower quality seamlessly. Each segment is 2-10 seconds, independently cacheable by CDN. DASH is the open standard alternative using .mpd manifest and .m4s segments.
Step 3: DRM Encryption: Widevine (Google/Android), FairPlay (Apple), PlayReady (Microsoft). Each segment encrypted with AES-128. License server provides decryption keys to authenticated clients.
CDN (Content Delivery Network)
- Popular videos (top 20%) cached at edge → serve 80% of traffic
- Less popular → fetch from origin, cache with shorter TTL
- Long tail → direct from origin, no caching
- Global deployment: 200+ edge locations (PoPs) worldwide
- Origin shield: Intermediate cache layer reduces origin load
- Popular content: Push to edge proactively; Long tail: Pull on first request
Recommendation Service
- Collaborative filtering: "Users who watched X also watched Y"
- Content-based: Similar genre, director, actors, tags
- Deep learning: Video embeddings (analyze visual/audio content)
- Serving: Pre-compute recommendations offline (Spark) → cache in Redis → serve in < 50 ms
View Count Aggregation Pipeline
Real-time view counting at scale: Client sends "view" event → buffered on API server for 5 seconds → Kafka topic view-events absorbs the write burst → Flink streaming job aggregates per video per minute → Redis INCR for real-time approximate display count → ClickHouse stores granular view data for analytics. Hourly reconciliation batch job: exact count from ClickHouse → update Redis and MySQL metadata.
DAG Scheduler (Pipeline Orchestrator)
The video processing pipeline has task dependencies that form a DAG. Orchestrator options: Temporal (recommended), AWS Step Functions, Apache Airflow. Tasks can run in parallel but merge must wait for ALL to complete. Each task retries independently: if 720p transcode fails, only that task retries.
Search Service (Elasticsearch)
- Indexed fields: Video title, description, tags, channel name, transcript
- Features: Full-text search with BM25 ranking, fuzzy matching, autocomplete
- Sync: Video metadata changes in MySQL → Kafka CDC → Elasticsearch consumer updates index (< 2 second lag)
Event Bus Design (Kafka)
Topic: video_streaming_platform-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "video_streaming_platform-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: video_streaming_platform-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Video Streaming Platform (YouTube / Netflix): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Upload Video
POST /api/v1/videos/upload-url
Response: 200 OK
{
"upload_url": "https://s3.amazonaws.com/uploads/...",
"video_id": "video-uuid"
}
POST /api/v1/videos/{video_id}/metadata
{
"title": "System Design in 10 Minutes",
"description": "...",
"tags": ["system design", "tutorial"],
"category": "education",
"visibility": "public"
}Stream Video
GET /api/v1/videos/{video_id}/manifest
Response: 200 OK (redirects to CDN)
{
"manifest_url": "https://cdn.example.com/videos/{video_id}/manifest.m3u8",
"thumbnail_url": "https://cdn.example.com/videos/{video_id}/thumb.jpg"
}Search
GET /api/v1/search?q=system+design&type=video&sort=relevanceGet Recommendations
GET /api/v1/recommendations?limit=20Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingMySQL: Video Metadata (Sharded by video_id)
CREATE TABLE videos (
video_id BIGINT PRIMARY KEY,
channel_id BIGINT NOT NULL,
title VARCHAR(100),
description TEXT,
tags JSON,
category VARCHAR(50),
duration_sec INT,
status ENUM('processing', 'ready', 'failed', 'removed'),
visibility ENUM('public', 'unlisted', 'private'),
view_count BIGINT DEFAULT 0,
like_count INT DEFAULT 0,
dislike_count INT DEFAULT 0,
manifest_url TEXT,
thumbnail_url TEXT,
upload_date TIMESTAMP,
INDEX idx_channel (channel_id, upload_date DESC)
);S3: Video Storage Structure
Bucket: video-originals
/{video_id}/original.mp4
Bucket: video-transcoded
/{video_id}/manifest.m3u8
/{video_id}/240p/playlist.m3u8
/{video_id}/240p/segment_001.ts
...Cassandra: View Events
CREATE TABLE view_events (
video_id BIGINT,
view_date DATE,
view_hour INT,
user_id UUID,
watch_duration INT,
quality VARCHAR,
device VARCHAR,
country VARCHAR,
PRIMARY KEY ((video_id, view_date), view_hour, user_id)
);Redis: View Counters + Hot Video Cache
Key: views:{video_id}
Value: counter (INCR)
Key: video:meta:{video_id}
Value: Hash { title, channel, manifest_url, thumbnail_url }
TTL: 3600| Concern | Solution |
|---|---|
| Upload failure | S3 multipart upload (resumable); client retries from last chunk |
| Transcoding failure | Retry failed segments; DLQ for persistent failures |
| CDN edge failure | CDN automatically routes to next closest PoP |
| Origin failure | S3 cross-region replication; CDN caches absorb the load |
| Video corruption | Checksum verification at each stage; re-transcode from original |
| Popularity surge | CDN pre-warming for predicted viral content; auto-scale origin |
Specific: Handling a Viral Video
- Video starts getting 10M views/minute
- CDN edge caches fill up → 90% of requests served from edge
- View counter: Don't write to DB for every view. Batch in memory → flush every 5 seconds
- Comment section: Rate limit comments per user; paginate aggressively
Video Segment Prefetching
- Client prefetches next 2-3 segments while playing current segment
- If user seeks to a new position → cancel prefetch, start buffering from seek point
- Adaptive: if bandwidth is high, prefetch more; if low, prefetch less
Thumbnail Generation
- Extract frames at 10% intervals → pick the most "interesting" frame (highest entropy, face detection)
- Or: generate "video preview" (6-second animated summary) shown on hover
Copyright / Content ID
- Audio fingerprinting: Match audio against copyrighted music database
- Video fingerprinting: Perceptual hashing to detect re-uploads
- Action: Block upload, mute audio, add ads, or allow with claim
Cost Optimization
- Storage tiering: Frequently accessed videos on SSD-backed S3; rarely viewed on S3 Glacier
- Encoding optimization: Only transcode to 4K if original is 4K; don't upscale
- CDN cost: Negotiate bandwidth tiers; use multi-CDN for cost and resilience
- Keep original: Always keep the original file: codecs improve, and you can re-transcode later
Live Streaming Architecture
- Ingest: RTMP from broadcaster → transcoding server
- Real-time transcoding: Must be fast (< 1 second per segment)
- Delivery: HLS/DASH with very short segments (2 seconds) for low latency
- Glass-to-glass latency: Target < 5 seconds (use LL-HLS for < 3s)
- DVR: Store live segments for rewind/replay
Interview Walkthrough
- Clarify VOD vs live upfront — live adds real-time transcoding, short HLS segments, and glass-to-glass latency constraints.
- Walk the upload path: ingest → transcode to multiple bitrates/resolutions → package as HLS/DASH segments → store in object storage.
- Place a CDN in front of segment delivery; discuss adaptive bitrate switching based on client bandwidth measurements.
- For live streams, target <5s latency with 2-second segments and LL-HLS; trade segment size against buffering and CDN cache efficiency.
- Separate metadata (title, thumbnails, view counts) in a database from video blobs in S3 — never serve video through the API tier.
- Quantify bandwidth with Back-of-the-Envelope Estimation: 1M concurrent viewers × 5 Mbps = 5 Tbps peak egress — CDN is mandatory.
- Common pitfall: serving video files directly from origin without CDN — a viral video takes down the entire platform.
HLS vs DASH vs WebRTC: Choosing the Streaming Protocol
HLS is the most widely compatible (all browsers + native iOS/Android). DASH is the open standard. WebRTC offers < 500ms latency but no DRM and doesn't scale to millions of viewers. YouTube/Netflix choose HLS + DASH leveraging existing CDN infrastructure. CMAF is the future standard unifying both.
Codec Selection: H.264 vs H.265 vs AV1
H.264 has universal support. H.265 is 50% better compression but expensive licensing. AV1 is 30% better than H.265, royalty-free, but 50-100x slower encoding. Netflix's actual approach: encode in multiple codecs per resolution, serve the best codec the client supports.
Pre-Signed Upload URL: Why Upload Directly to S3?
Naive approach (Client → API Server → S3) makes the API server a bottleneck proxying gigabytes of video with double bandwidth cost. Direct upload via pre-signed S3 URL: API server handles only metadata, S3 handles the heavy lifting, resumable via multipart upload.
View Count: Why Not Just INCREMENT a Database Counter?
At 1M concurrent viewers, naively incrementing a DB counter would cause lock contention. YouTube's actual approach: client sends "view" event → buffered in memory → batch flush to Kafka → Flink aggregates per video per minute → Redis INCR for approximate real-time count → exact counts reconciled hourly.
Storage Tiering: The 80/20 Rule for Video
20% of videos account for 80% of views (power law). Hot Storage: top 20% most-viewed + all videos uploaded in last 7 days. Warm Storage: videos with > 10 views/month. Cold Storage (Glacier): videos with < 10 views/month and older than 1 year. Auto-tiering monitors access patterns and moves content between tiers. At Netflix scale, this saves hundreds of millions per year.
Why MySQL for Video Metadata
Video metadata access patterns (listing by channel sorted by date, complex admin queries with JOINs) need relational queries: MySQL excels. Video metadata is small (< 1KB per video, even at 1B videos = 1TB). Cassandra has no JOINs and known issues with counter columns.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — MVP (single region, HLS only)
Monolith handles upload metadata + playback URLs. S3 origin, single FFmpeg worker pool, CloudFront CDN. PostgreSQL for video metadata. Synchronous transcode on upload (works up to ~1K uploads/day).
Key components: Monolith · S3 · FFmpeg workers · CloudFront · PostgreSQL
Move to next phase when: Transcode queue backlog exceeds 2 hours; playback stalls on upload spike
Phase 2 — Scale (async transcode, ABR ladder)
Kafka job queue + GPU worker autoscaling. Multi-rendition HLS with master manifest. Pre-signed CDN URLs. Redis for view count aggregates. Separate upload service (multipart) from playback API.
Key components: Kafka · GPU transcode fleet · Redis counters · Signed CDN URLs · Metadata service
Move to next phase when: Viral video origin egress spike; view count Redis hot key
Phase 3 — Global (multi-CDN, per-title encoding)
Multi-region S3 with cross-region replication. Origin shield + secondary CDN failover. Per-title encoding (skip 1080p for low-complexity content). Flink view aggregation. AV1 renditions for supported devices.
Key components: Multi-CDN · Origin shield · Flink analytics · Per-title encoding · Cross-region replication
Move to next phase when: Single CDN outage causes 30-min playback blackout; encoding cost exceeds storage cost
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Playback start p99 latency | < 2 sec | Manifest fetch + first segment from edge |
| Playback availability | 99.9% | Core product — ~43 min downtime/month |
| Transcode completion p95 | < 30 min | Upload-to-ready for 10-min 1080p video |
| CDN cache hit ratio | > 95% | Origin egress cost control |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| CDN cache miss storm on new viral video | Origin egress 10× baseline; CloudFront origin error rate spikes | Enable origin shield; pre-warm top 50 segments via cache-prefetch API; temporarily extend segment TTL; rate-limit manifest requests per IP |
| Transcode worker pool exhausted during upload spike | Kafka consumer lag > 100K; upload-to-ready SLA breach alerts | Autoscale GPU workers; deprioritize re-transcode jobs; serve 720p-only for new uploads until backlog clears |
| Signed URL key compromise | Abnormal bandwidth on premium content; URLs shared on forums | Rotate signing key immediately; shorten TTL to 15 min; enable IP/session binding; audit access logs for pattern |
Cost Drivers (Staff lens)
- CDN egress: 200M DAU × 1 hr/day × 2 Mbps avg ≈ 4.3 EB/month raw — CDN reduces origin cost but egress dominates
- Transcoding: 500K uploads × 5 renditions × ~$0.03/min GPU ≈ $750K/day at scale
- S3 storage: 500K × 10 min × 5 renditions × ~50 MB avg ≈ 1.25 PB/day ingested (lifecycle to Glacier for old renditions)
Multi-Region & DR
Active-active playback via geo-routed CDN (content replicated to regional origins). Upload lands in nearest region; async cross-region replication for popular content. Transcode runs in upload region; remote regions pull on first cache miss via origin shield.