Design a Video Streaming Platform (YouTube / Netflix) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	VOD-first (YouTube/Netflix style). Nail HLS vs DASH segment model, transcoding ladder, CDN edge caching with pre-signed URLs, and view-count aggregation without blocking playback.
Arch 50	Add live streaming (LL-HLS), DRM key delivery, per-title encoding, and cold-start / viral video hot-key mitigation on origin.
Arch 75	Staff: multi-CDN failover, cost of egress at 1B hours/month, and how to migrate encoding profiles without re-transcoding the entire library.

Interview Prompt

Design a video streaming platform like YouTube or Netflix. Users upload videos, the platform transcodes them into multiple quality levels, and viewers stream with adaptive bitrate playback. Support upload, playback, and view count analytics.

Clarifying Questions (ask before designing)

Question	Why it matters
VOD, live, or both?	VOD is upload → transcode → CDN cache. Live adds ingest RTMP/SRT, low-latency segment publishing, and no full-file pre-transcode.
What's the average video length and upload volume?	10-min avg × 500K uploads/day = transcoding farm sizing. 1-hour 4K uploads dominate queue depth.
Do we need DRM or is signed URLs enough?	Premium content needs Widevine/FairPlay license server. User-generated content often uses time-limited pre-signed CDN URLs.
View counts — real-time or eventually consistent?	Real-time counters create hot keys on viral videos. Batch aggregation (Kafka → Flink) is standard for display counts.

Scope

In scope

Video upload, transcoding pipeline, and metadata storage
Adaptive bitrate playback (HLS/DASH)
CDN edge delivery with pre-signed URLs
View count aggregation
Capacity estimation for storage and egress

Out of scope (state explicitly)

Recommendation / home feed ranking (#48, #65)
Live chat and comments (#36)
Full DRM license server internals
Content moderation pipeline (#81)

Assumptions

500K new uploads/day, avg 10 min, 1080p source
200M DAU, avg 1 hour watch time/day
99.9% playback availability; upload can retry
5 renditions per video: 360p–1080p + audio-only

Upload videos: Users upload videos with title, description, tags, thumbnails
Stream videos: Adaptive bitrate streaming (adjust quality based on bandwidth)
Search videos: By title, description, tags, channel
Recommendations: Personalized "what to watch next"
Channels/Subscriptions: Subscribe to channels, receive updates
Engagement: Like, dislike, comment, share
Watch history and resume playback
Live streaming (optional: YouTube Live)
Monetization: Ads, premium subscriptions

Metric	Calculation	Value
DAU	Given (product assumption)	800M
Videos watched / day	800M DAU × ~6 videos	5B
Avg video duration	Given (typical workload assumption)	5 min
Streaming bandwidth / video	Given (assumption documented in value)	5 Mbps (avg bitrate)
Concurrent viewers	Given (peak load assumption)	100M
Peak bandwidth	100M × 5 Mbps	500 Tbps
Videos uploaded / day	500 hrs/min × 60 min × 24	720K
Avg original video size	Given (typical workload assumption)	500 MB
Upload storage / day	720K × 500 MB	360 TB
Transcoded versions (6 resolutions)	360 TB × 3	~1 PB/day
Total storage (existing)	Given	~1 EB (exabyte)

Loading...

Upload Flow

Client requests a pre-signed upload URL from Upload Service
Client uploads video directly to S3 (chunked upload for large files)
Upload Service creates a metadata record (status = "processing")
Publishes video-uploaded event to Kafka
Video Processing Pipeline picks up the event

Video Processing Pipeline: The Most Complex Part

Step 1: Transcoding: Convert original video to multiple resolutions and bitrates: 240p (400 Kbps) through 4K (20 Mbps). Codec: H.264 (broad compatibility), H.265/HEVC (50% better compression), or AV1 (royalty-free, best compression). Parallel processing: split video into 10-second segments, transcode each in parallel, then reassemble.

Step 2: Adaptive Bitrate Streaming Packaging: HLS (HTTP Live Streaming) uses Master Playlist (.m3u8) referencing quality playlists. Client downloads master playlist, selects quality based on available bandwidth. If bandwidth drops, client switches to lower quality seamlessly. Each segment is 2-10 seconds, independently cacheable by CDN. DASH is the open standard alternative using .mpd manifest and .m4s segments.

Step 3: DRM Encryption: Widevine (Google/Android), FairPlay (Apple), PlayReady (Microsoft). Each segment encrypted with AES-128. License server provides decryption keys to authenticated clients.

CDN (Content Delivery Network)

Popular videos (top 20%) cached at edge → serve 80% of traffic
Less popular → fetch from origin, cache with shorter TTL
Long tail → direct from origin, no caching
Global deployment: 200+ edge locations (PoPs) worldwide
Origin shield: Intermediate cache layer reduces origin load
Popular content: Push to edge proactively; Long tail: Pull on first request

Recommendation Service

Collaborative filtering: "Users who watched X also watched Y"
Content-based: Similar genre, director, actors, tags
Deep learning: Video embeddings (analyze visual/audio content)
Serving: Pre-compute recommendations offline (Spark) → cache in Redis → serve in < 50 ms

View Count Aggregation Pipeline

Real-time view counting at scale: Client sends "view" event → buffered on API server for 5 seconds → Kafka topic view-events absorbs the write burst → Flink streaming job aggregates per video per minute → Redis INCR for real-time approximate display count → ClickHouse stores granular view data for analytics. Hourly reconciliation batch job: exact count from ClickHouse → update Redis and MySQL metadata.

DAG Scheduler (Pipeline Orchestrator)

The video processing pipeline has task dependencies that form a DAG. Orchestrator options: Temporal (recommended), AWS Step Functions, Apache Airflow. Tasks can run in parallel but merge must wait for ALL to complete. Each task retries independently: if 720p transcode fails, only that task retries.

Loading...

Search Service (Elasticsearch)

Indexed fields: Video title, description, tags, channel name, transcript
Features: Full-text search with BM25 ranking, fuzzy matching, autocomplete
Sync: Video metadata changes in MySQL → Kafka CDC → Elasticsearch consumer updates index (< 2 second lag)

Event Bus Design (Kafka)

Topic: video_streaming_platform-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "video_streaming_platform-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: video_streaming_platform-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Video Streaming Platform (YouTube / Netflix): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Upload Video

HTTP

POST /api/v1/videos/upload-url
Response: 200 OK
{
  "upload_url": "https://s3.amazonaws.com/uploads/...",
  "video_id": "video-uuid"
}

POST /api/v1/videos/{video_id}/metadata
{
  "title": "System Design in 10 Minutes",
  "description": "...",
  "tags": ["system design", "tutorial"],
  "category": "education",
  "visibility": "public"
}

Stream Video

HTTP

GET /api/v1/videos/{video_id}/manifest
Response: 200 OK (redirects to CDN)
{
  "manifest_url": "https://cdn.example.com/videos/{video_id}/manifest.m3u8",
  "thumbnail_url": "https://cdn.example.com/videos/{video_id}/thumb.jpg"
}

Search

HTTP

GET /api/v1/search?q=system+design&type=video&sort=relevance

Get Recommendations

HTTP

GET /api/v1/recommendations?limit=20

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

MySQL: Video Metadata (Sharded by video_id)

SQL

CREATE TABLE videos (
    video_id        BIGINT PRIMARY KEY,
    channel_id      BIGINT NOT NULL,
    title           VARCHAR(100),
    description     TEXT,
    tags            JSON,
    category        VARCHAR(50),
    duration_sec    INT,
    status          ENUM('processing', 'ready', 'failed', 'removed'),
    visibility      ENUM('public', 'unlisted', 'private'),
    view_count      BIGINT DEFAULT 0,
    like_count      INT DEFAULT 0,
    dislike_count   INT DEFAULT 0,
    manifest_url    TEXT,
    thumbnail_url   TEXT,
    upload_date     TIMESTAMP,
    INDEX idx_channel (channel_id, upload_date DESC)
);

S3: Video Storage Structure

Bucket: video-originals
  /{video_id}/original.mp4

Bucket: video-transcoded
  /{video_id}/manifest.m3u8
  /{video_id}/240p/playlist.m3u8
  /{video_id}/240p/segment_001.ts
  ...

Cassandra: View Events

SQL

CREATE TABLE view_events (
    video_id        BIGINT,
    view_date       DATE,
    view_hour       INT,
    user_id         UUID,
    watch_duration  INT,
    quality         VARCHAR,
    device          VARCHAR,
    country         VARCHAR,
    PRIMARY KEY ((video_id, view_date), view_hour, user_id)
);

Redis: View Counters + Hot Video Cache

Key:    views:{video_id}
Value:  counter (INCR)

Key:    video:meta:{video_id}
Value:  Hash { title, channel, manifest_url, thumbnail_url }
TTL:    3600

Concern	Solution
Upload failure	S3 multipart upload (resumable); client retries from last chunk
Transcoding failure	Retry failed segments; DLQ for persistent failures
CDN edge failure	CDN automatically routes to next closest PoP
Origin failure	S3 cross-region replication; CDN caches absorb the load
Video corruption	Checksum verification at each stage; re-transcode from original
Popularity surge	CDN pre-warming for predicted viral content; auto-scale origin

Specific: Handling a Viral Video

Video starts getting 10M views/minute
CDN edge caches fill up → 90% of requests served from edge
View counter: Don't write to DB for every view. Batch in memory → flush every 5 seconds
Comment section: Rate limit comments per user; paginate aggressively

Video Segment Prefetching

Client prefetches next 2-3 segments while playing current segment
If user seeks to a new position → cancel prefetch, start buffering from seek point
Adaptive: if bandwidth is high, prefetch more; if low, prefetch less

Thumbnail Generation

Extract frames at 10% intervals → pick the most "interesting" frame (highest entropy, face detection)
Or: generate "video preview" (6-second animated summary) shown on hover

Copyright / Content ID

Audio fingerprinting: Match audio against copyrighted music database
Video fingerprinting: Perceptual hashing to detect re-uploads
Action: Block upload, mute audio, add ads, or allow with claim

Cost Optimization

Storage tiering: Frequently accessed videos on SSD-backed S3; rarely viewed on S3 Glacier
Encoding optimization: Only transcode to 4K if original is 4K; don't upscale
CDN cost: Negotiate bandwidth tiers; use multi-CDN for cost and resilience
Keep original: Always keep the original file: codecs improve, and you can re-transcode later

Live Streaming Architecture

Ingest: RTMP from broadcaster → transcoding server
Real-time transcoding: Must be fast (< 1 second per segment)
Delivery: HLS/DASH with very short segments (2 seconds) for low latency
Glass-to-glass latency: Target < 5 seconds (use LL-HLS for < 3s)
DVR: Store live segments for rewind/replay

Interview Walkthrough

Clarify VOD vs live upfront — live adds real-time transcoding, short HLS segments, and glass-to-glass latency constraints.
Walk the upload path: ingest → transcode to multiple bitrates/resolutions → package as HLS/DASH segments → store in object storage.
Place a CDN in front of segment delivery; discuss adaptive bitrate switching based on client bandwidth measurements.
For live streams, target <5s latency with 2-second segments and LL-HLS; trade segment size against buffering and CDN cache efficiency.
Separate metadata (title, thumbnails, view counts) in a database from video blobs in S3 — never serve video through the API tier.
Quantify bandwidth with Back-of-the-Envelope Estimation: 1M concurrent viewers × 5 Mbps = 5 Tbps peak egress — CDN is mandatory.
Common pitfall: serving video files directly from origin without CDN — a viral video takes down the entire platform.

HLS vs DASH vs WebRTC: Choosing the Streaming Protocol

HLS is the most widely compatible (all browsers + native iOS/Android). DASH is the open standard. WebRTC offers < 500ms latency but no DRM and doesn't scale to millions of viewers. YouTube/Netflix choose HLS + DASH leveraging existing CDN infrastructure. CMAF is the future standard unifying both.

Codec Selection: H.264 vs H.265 vs AV1

H.264 has universal support. H.265 is 50% better compression but expensive licensing. AV1 is 30% better than H.265, royalty-free, but 50-100x slower encoding. Netflix's actual approach: encode in multiple codecs per resolution, serve the best codec the client supports.

Pre-Signed Upload URL: Why Upload Directly to S3?

Naive approach (Client → API Server → S3) makes the API server a bottleneck proxying gigabytes of video with double bandwidth cost. Direct upload via pre-signed S3 URL: API server handles only metadata, S3 handles the heavy lifting, resumable via multipart upload.

View Count: Why Not Just INCREMENT a Database Counter?

At 1M concurrent viewers, naively incrementing a DB counter would cause lock contention. YouTube's actual approach: client sends "view" event → buffered in memory → batch flush to Kafka → Flink aggregates per video per minute → Redis INCR for approximate real-time count → exact counts reconciled hourly.

Storage Tiering: The 80/20 Rule for Video

20% of videos account for 80% of views (power law). Hot Storage: top 20% most-viewed + all videos uploaded in last 7 days. Warm Storage: videos with > 10 views/month. Cold Storage (Glacier): videos with < 10 views/month and older than 1 year. Auto-tiering monitors access patterns and moves content between tiers. At Netflix scale, this saves hundreds of millions per year.

Why MySQL for Video Metadata

Video metadata access patterns (listing by channel sorted by date, complex admin queries with JOINs) need relational queries: MySQL excels. Video metadata is small (< 1KB per video, even at 1B videos = 1TB). Cassandra has no JOINs and known issues with counter columns.

SLOs & Error Budgets

Metric	Target	Rationale
Playback start p99 latency	< 2 sec	Manifest fetch + first segment from edge
Playback availability	99.9%	Core product — ~43 min downtime/month
Transcode completion p95	< 30 min	Upload-to-ready for 10-min 1080p video
CDN cache hit ratio	> 95%	Origin egress cost control

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
CDN cache miss storm on new viral video	Origin egress 10× baseline; CloudFront origin error rate spikes	Enable origin shield; pre-warm top 50 segments via cache-prefetch API; temporarily extend segment TTL; rate-limit manifest requests per IP
Transcode worker pool exhausted during upload spike	Kafka consumer lag > 100K; upload-to-ready SLA breach alerts	Autoscale GPU workers; deprioritize re-transcode jobs; serve 720p-only for new uploads until backlog clears
Signed URL key compromise	Abnormal bandwidth on premium content; URLs shared on forums	Rotate signing key immediately; shorten TTL to 15 min; enable IP/session binding; audit access logs for pattern

Cost Drivers (Staff lens)

CDN egress: 200M DAU × 1 hr/day × 2 Mbps avg ≈ 4.3 EB/month raw — CDN reduces origin cost but egress dominates
Transcoding: 500K uploads × 5 renditions × ~$0.03/min GPU ≈ $750K/day at scale
S3 storage: 500K × 10 min × 5 renditions × ~50 MB avg ≈ 1.25 PB/day ingested (lifecycle to Glacier for old renditions)

Multi-Region & DR

Active-active playback via geo-routed CDN (content replicated to regional origins). Upload lands in nearest region; async cross-region replication for popular content. Transcode runs in upload region; remote regions pull on first cache miss via origin shield.