Design a Video Transcoding Pipeline

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Video Transcoding Pipeline.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: DAG-based job orchestration, GPU worker pools, Codec selection?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

DAG-based job orchestration
GPU worker pools
Codec selection
Chunk-level parallelism
Retry & poison pill handling
Capacity estimation with shown math

Out of scope (state explicitly)

Recommendation / home feed ranking (#48, #65)
Live chat and comments (#36)
DRM license server internals

Assumptions

Clarify scale (DAU, QPS, data volume) for video transcoding pipeline in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Ingest videos: Accept uploaded video files in any format (MP4, AVI, MOV, MKV, WebM)
Multi-resolution transcoding: Convert to multiple resolutions (240p, 360p, 480p, 720p, 1080p, 4K)
Multi-codec support: Encode in H.264, H.265/HEVC, VP9, AV1
Adaptive bitrate packaging: Package into HLS (.m3u8 + .ts) and DASH (.mpd + .m4s)
Audio processing: Extract, normalize, and transcode audio (AAC, Opus) at multiple bitrates
Subtitle extraction: Auto-generate subtitles via speech-to-text; support uploaded subtitle files
Thumbnail generation: Extract keyframes, generate sprite sheets for seek preview
DRM encryption: Encrypt segments with Widevine/FairPlay/PlayReady
Watermarking: Forensic or visible watermarking for content protection
Progress tracking: Real-time progress reporting for upload → transcode → ready
Priority queues: Premium content (paid creators) gets transcoded faster

Metric	Calculation	Value
Videos uploaded / hour	Given (assumption documented in value)	10,000
Avg original video duration	Given (typical workload assumption)	10 minutes
Avg original file size	Given (typical workload assumption)	1 GB
Upload storage / day	Derived from upstream throughput × size	240 TB
Transcoded variants per video	6 resolutions × 2 codecs	12
Expansion factor (transcoded/original)	Given	~3×
Transcoded storage / day	Derived from upstream throughput × size	720 TB
CPU-hours per video (H.264)	Given	~2 CPU-hours
GPU-hours per video (H.265/AV1)	Given	~0.5 GPU-hours
Total compute / day	240K videos × (2 CPU-h + 0.5 GPU-h)	~480K CPU-hours + ~120K GPU-hours

The transcoding pipeline accepts uploaded videos, probes them for format info, splits into segments, transcodes in parallel across multiple resolutions/codecs, packages into HLS/DASH, and runs post-processing (thumbnails, subtitles, moderation, DRM).

Loading...

Segment-Based Parallel Transcoding: The Core Optimization

Why split into segments? Without splitting: 1 video × 6 resolutions = 6 tasks at 20 minutes each = 20 min wall-clock. With splitting (10-second segments): 60 segments × 6 resolutions = 360 tasks at ~3 seconds each = ~30 seconds wall-clock with 40 workers.

GOP-aligned splitting: Must split at GOP boundaries (I-frame positions). FFmpeg: ffmpeg -i input.mp4 -c copy -f segment -segment_time 10 -reset_timestamps 1 segment_%03d.mp4. The -c copy flag ensures no re-encoding during split (fast, lossless).

Reassembly: FFmpeg concat demuxer. For HLS, segments ARE the final output: no reassembly needed! Just generate the .m3u8 playlist.

FFmpeg Command Breakdown

H.264 (CPU)

BASH

ffmpeg -i segment_005.mp4 -vf "scale=1280:720" -c:v libx264 -preset medium -crf 23 -profile:v high -level 4.0 -maxrate 3M -bufsize 6M -g 48 -sc_threshold 0 -an segment_005_720p.ts

H.265 (GPU NVENC)

BASH

ffmpeg -i segment_005.mp4 -vf "scale=1280:720" -c:v hevc_nvenc -preset p5 -rc:v vbr -cq 28 -maxrate 1.5M -bufsize 3M -g 48 -tag:v hvc1 segment_005_720p_h265.ts

Pipeline Orchestrator: Temporal/Step Functions

Kafka consumers alone can't express complex DAG dependencies. Temporal ⭐ provides DAG definition, per-step retries with exponential backoff, workflow state persistence, visibility dashboard, timeouts, and versioning. Each step has a retry policy with configurable initial interval, maximum interval, maximum attempts, and non-retryable errors.

@workflow
def transcode_pipeline(video_id, s3_key):
    probe_result = await probe_video(s3_key)
    segments = await split_video(s3_key, probe_result)
    resolutions = determine_resolutions(probe_result)
    transcode_futures = []
    for res in resolutions:
        for segment in segments:
            future = transcode_segment.async(segment, res, 'h264')
            transcode_futures.append(future)
    transcoded = await all(transcode_futures)
    manifest = await package_hls_dash(transcoded, video_id)
    await all(
        generate_thumbnails(s3_key, probe_result),
        extract_audio(s3_key),
        content_moderation(s3_key),
        generate_subtitles(s3_key)
    )
    await encrypt_drm(manifest, video_id)
    await update_video_status(video_id, 'ready')

Event Bus Design (Kafka)

Topic: video_transcoding_pipeline-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "video_transcoding_pipeline-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: video_transcoding_pipeline-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Video Transcoding Pipeline: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Initiate Upload

HTTP

POST /api/v1/videos/upload
{
  "filename": "vacation.mp4",
  "content_type": "video/mp4",
  "file_size_bytes": 1073741824,
  "title": "Summer Vacation 2025"
}
Response: 200 OK
{
  "video_id": "vid-uuid",
  "upload_url": "https://s3.amazonaws.com/originals/vid-uuid/upload?X-Amz-...",
  "upload_id": "multipart-upload-id",
  "max_chunk_size": 104857600
}

Check Transcoding Status

HTTP

GET /api/v1/videos/{video_id}/status
Response: 200 OK
{
  "video_id": "vid-uuid",
  "status": "transcoding",
  "pipeline_progress": {
    "probe": "completed",
    "split": "completed",
    "transcode": { "completed": 45, "total": 60, "percent": 75 },
    "package": "pending",
    "thumbnails": "completed",
    "moderation": "pending",
    "drm": "pending"
  },
  "estimated_completion": "2025-03-14T11:15:00Z",
  "started_at": "2025-03-14T10:45:00Z"
}

Retry Failed Pipeline

HTTP

POST /api/v1/videos/{video_id}/retry
{ "from_step": "transcode" }
Response: 200 OK
{ "status": "retrying", "retry_from": "transcode" }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

MySQL: Video and Job Metadata

SQL

CREATE TABLE videos (
    video_id        VARCHAR(36) PRIMARY KEY,
    creator_id      VARCHAR(36) NOT NULL,
    title           VARCHAR(255),
    original_s3_key TEXT NOT NULL,
    original_format VARCHAR(10),
    duration_sec    INT,
    original_width  INT,
    original_height INT,
    original_codec  VARCHAR(20),
    original_bitrate_kbps INT,
    file_size_bytes BIGINT,
    status          ENUM('uploaded','probing','transcoding','packaging',
                         'moderating','ready','failed','removed') DEFAULT 'uploaded',
    preset          VARCHAR(20) DEFAULT 'standard',
    priority        ENUM('low','normal','high','urgent') DEFAULT 'normal',
    manifest_url    TEXT,
    thumbnail_url   TEXT,
    error_message   TEXT,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    completed_at    TIMESTAMP,
    INDEX idx_status (status),
    INDEX idx_creator (creator_id, created_at DESC)
);

CREATE TABLE transcoding_tasks (
    task_id         VARCHAR(36) PRIMARY KEY,
    video_id        VARCHAR(36) NOT NULL,
    task_type       ENUM('probe','split','transcode','package','thumbnail',
                         'audio','moderation','drm','subtitle'),
    resolution      VARCHAR(10),
    codec           VARCHAR(10),
    segment_index   INT,
    status          ENUM('pending','running','completed','failed','retrying'),
    worker_id       VARCHAR(36),
    s3_input_key    TEXT,
    s3_output_key   TEXT,
    started_at      TIMESTAMP,
    completed_at    TIMESTAMP,
    duration_ms     INT,
    error_message   TEXT,
    retry_count     INT DEFAULT 0,
    INDEX idx_video (video_id, task_type),
    INDEX idx_status (status),
    INDEX idx_worker (worker_id)
);

S3: Storage Structure

Bucket: video-originals (NEVER deleted, cross-region replicated)
  /{video_id}/original.mp4

Bucket: video-segments (temporary, auto-delete after 7 days)
  /{video_id}/segments/segment_000.mp4

Bucket: video-transcoded (long-term, lifecycle policies)
  /{video_id}/h264/720p/segment_000.ts
  /{video_id}/h264/720p/playlist.m3u8
  /{video_id}/h265/720p/segment_000.ts
  /{video_id}/manifest.m3u8
  /{video_id}/manifest.mpd
  /{video_id}/thumbnails/thumb_001.jpg
  /{video_id}/subtitles/en.vtt
  /{video_id}/audio/aac_128k.m4a

Redis: Pipeline State + Queue Management

pipeline:{video_id}   → Hash { status, current_step, transcode_completed, transcode_total, started_at, estimated_completion }
task_queue:transcode:cpu  → Sorted Set { task_id: priority_score }
task_queue:transcode:gpu  → Sorted Set { task_id: priority_score }
worker:{worker_id}  → Hash { status, current_task, last_heartbeat }

Concern	Solution
Original file lost	S3 cross-region replication; 11 nines durability; versioning enabled
Transcode worker crash	Temporal detects heartbeat timeout → re-schedule on another worker
Segment transcode failure	Retry 3 times with backoff; after 3 failures → DLQ, alert ops
S3 upload failure	Retry with exponential backoff; S3 multipart for large segments
Worker pool exhaustion	Auto-scaling based on queue depth; alert if depth > 1000 for > 10 min
Corrupt input video	Probe step detects invalid file → fail fast, notify creator
Spot instance preemption	Task checkpointing; preempted task re-queued automatically

Handle Spot Instance Preemption

GPU instances are expensive. Spot instances save 60-80%. Strategy: use spot for transcoding workers (segment-based, each takes 3-10s, usually finishes before preemption). If preempted: worker marks task as "interrupted", Temporal re-schedules on another worker. Critical path (probe, package, publish) uses on-demand instances.

Quality Verification After Transcoding

Automated checks: duration match (±0.5s), frame count match, VMAF score (> 80 for 720p, > 85 for 1080p), audio-video sync (< 50ms drift), black frame/freeze detection, and bitrate compliance (±20% of target). These checks add ~5 seconds per video but catch 0.1% of errors.

CRF vs CBR vs VBR: Bitrate Control Strategies

Strategy	Description	Best For
CBR (Constant Bitrate)	Every second uses same bitrate. Predictable but wastes bits on simple scenes.	Live streaming (consistent bandwidth)
VBR (Variable Bitrate)	Bitrate varies by scene complexity. Better perceptual quality.	VOD (pre-recorded)
CRF ⭐ (Constant Rate Factor)	Target constant QUALITY, let bitrate vary. Best quality for given size.	VOD transcoding (YouTube, Netflix). Use CRF + maxrate for best of both worlds.

Netflix's per-title encoding: Encode test segment at multiple CRFs → measure VMAF → pick optimal CRF per video. Animated content (CRF 28) vs action movie (CRF 20). Result: 20-40% bitrate savings over fixed-CRF.

Per-Title vs Per-Shot Encoding

Per-Title: For each video, run convex hull analysis across CRF values and resolutions. Select optimal CRF per resolution maximizing VMAF/bitrate ratio. Up to 40% bandwidth savings.

Per-Shot (state of the art): Split video into shots (scene changes). Each shot gets its own encoding parameters. Dialogue (static): CRF 28, car chase (motion): CRF 20, credits: CRF 30. Additional 10-20% savings over per-title.

Cost Optimization: When to Encode Which Codec

Tier 1 (all videos): H.264 at 480p, 720p, 1080p. Cost: ~$0.02/video.

Tier 2 (> 100 views in first hour): Add H.265. Cost: ~$0.08/video (GPU).

Tier 3 (> 10K views): Add AV1. Cost: ~$0.50/video but saves 30% more bandwidth than H.265. At 10K views, ROI: 330×. Implementation: upload → immediate H.264; hourly cron checks view counts → queue H.265; daily cron queues AV1 for very popular videos.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Segment-Based Parallel Transcoding: The Core Optimization

FFmpeg Command Breakdown

H.264 (CPU)

H.265 (GPU NVENC)

Pipeline Orchestrator: Temporal/Step Functions

Event Bus Design (Kafka)

Initiate Upload

Check Transcoding Status

Retry Failed Pipeline

Common Error Responses

MySQL: Video and Job Metadata

S3: Storage Structure

Redis: Pipeline State + Queue Management

Handle Spot Instance Preemption

Quality Verification After Transcoding

Interview Walkthrough

CRF vs CBR vs VBR: Bitrate Control Strategies

Per-Title vs Per-Shot Encoding

Cost Optimization: When to Encode Which Codec

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR