This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Video Transcoding Pipeline.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: DAG-based job orchestration, GPU worker pools, Codec selection? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- DAG-based job orchestration
- GPU worker pools
- Codec selection
- Chunk-level parallelism
- Retry & poison pill handling
- Capacity estimation with shown math
Out of scope (state explicitly)
- Recommendation / home feed ranking (#48, #65)
- Live chat and comments (#36)
- DRM license server internals
Assumptions
- Clarify scale (DAU, QPS, data volume) for video transcoding pipeline in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Ingest videos: Accept uploaded video files in any format (MP4, AVI, MOV, MKV, WebM)
- Multi-resolution transcoding: Convert to multiple resolutions (240p, 360p, 480p, 720p, 1080p, 4K)
- Multi-codec support: Encode in H.264, H.265/HEVC, VP9, AV1
- Adaptive bitrate packaging: Package into HLS (.m3u8 + .ts) and DASH (.mpd + .m4s)
- Audio processing: Extract, normalize, and transcode audio (AAC, Opus) at multiple bitrates
- Subtitle extraction: Auto-generate subtitles via speech-to-text; support uploaded subtitle files
- Thumbnail generation: Extract keyframes, generate sprite sheets for seek preview
- DRM encryption: Encrypt segments with Widevine/FairPlay/PlayReady
- Watermarking: Forensic or visible watermarking for content protection
- Progress tracking: Real-time progress reporting for upload → transcode → ready
- Priority queues: Premium content (paid creators) gets transcoded faster
- Throughput: Process 10,000+ videos/hour
- Latency: Standard video ready within 30 minutes; short videos (< 5 min) within 5 minutes
- Durability: Original uploaded file NEVER lost; transcoded outputs can be regenerated
- Fault Tolerance: Any step failure → retry that step, not the entire pipeline
- Cost Efficient: GPU for H.265/AV1; CPU for H.264; spot instances for non-urgent work
- Scalability: Auto-scale based on queue depth; handle viral upload spikes
- Quality: Output quality comparable to or better than input
- Idempotent: Re-running a failed job produces the same output
| Metric | Calculation | Value |
|---|---|---|
| Videos uploaded / hour | Given (assumption documented in value) | 10,000 |
| Avg original video duration | Given (typical workload assumption) | 10 minutes |
| Avg original file size | Given (typical workload assumption) | 1 GB |
| Upload storage / day | Derived from upstream throughput × size | 240 TB |
| Transcoded variants per video | 6 resolutions × 2 codecs | 12 |
| Expansion factor (transcoded/original) | Given | ~3× |
| Transcoded storage / day | Derived from upstream throughput × size | 720 TB |
| CPU-hours per video (H.264) | Given | ~2 CPU-hours |
| GPU-hours per video (H.265/AV1) | Given | ~0.5 GPU-hours |
| Total compute / day | 240K videos × (2 CPU-h + 0.5 GPU-h) | ~480K CPU-hours + ~120K GPU-hours |
The transcoding pipeline accepts uploaded videos, probes them for format info, splits into segments, transcodes in parallel across multiple resolutions/codecs, packages into HLS/DASH, and runs post-processing (thumbnails, subtitles, moderation, DRM).
Segment-Based Parallel Transcoding: The Core Optimization
Why split into segments? Without splitting: 1 video × 6 resolutions = 6 tasks at 20 minutes each = 20 min wall-clock. With splitting (10-second segments): 60 segments × 6 resolutions = 360 tasks at ~3 seconds each = ~30 seconds wall-clock with 40 workers.
GOP-aligned splitting: Must split at GOP boundaries (I-frame positions). FFmpeg: ffmpeg -i input.mp4 -c copy -f segment -segment_time 10 -reset_timestamps 1 segment_%03d.mp4. The -c copy flag ensures no re-encoding during split (fast, lossless).
Reassembly: FFmpeg concat demuxer. For HLS, segments ARE the final output: no reassembly needed! Just generate the .m3u8 playlist.
FFmpeg Command Breakdown
H.264 (CPU)
ffmpeg -i segment_005.mp4 -vf "scale=1280:720" -c:v libx264 -preset medium -crf 23 -profile:v high -level 4.0 -maxrate 3M -bufsize 6M -g 48 -sc_threshold 0 -an segment_005_720p.tsH.265 (GPU NVENC)
ffmpeg -i segment_005.mp4 -vf "scale=1280:720" -c:v hevc_nvenc -preset p5 -rc:v vbr -cq 28 -maxrate 1.5M -bufsize 3M -g 48 -tag:v hvc1 segment_005_720p_h265.tsPipeline Orchestrator: Temporal/Step Functions
Kafka consumers alone can't express complex DAG dependencies. Temporal ⭐ provides DAG definition, per-step retries with exponential backoff, workflow state persistence, visibility dashboard, timeouts, and versioning. Each step has a retry policy with configurable initial interval, maximum interval, maximum attempts, and non-retryable errors.
@workflow
def transcode_pipeline(video_id, s3_key):
probe_result = await probe_video(s3_key)
segments = await split_video(s3_key, probe_result)
resolutions = determine_resolutions(probe_result)
transcode_futures = []
for res in resolutions:
for segment in segments:
future = transcode_segment.async(segment, res, 'h264')
transcode_futures.append(future)
transcoded = await all(transcode_futures)
manifest = await package_hls_dash(transcoded, video_id)
await all(
generate_thumbnails(s3_key, probe_result),
extract_audio(s3_key),
content_moderation(s3_key),
generate_subtitles(s3_key)
)
await encrypt_drm(manifest, video_id)
await update_video_status(video_id, 'ready')Event Bus Design (Kafka)
Topic: video_transcoding_pipeline-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "video_transcoding_pipeline-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: video_transcoding_pipeline-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Video Transcoding Pipeline: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Initiate Upload
POST /api/v1/videos/upload
{
"filename": "vacation.mp4",
"content_type": "video/mp4",
"file_size_bytes": 1073741824,
"title": "Summer Vacation 2025"
}
Response: 200 OK
{
"video_id": "vid-uuid",
"upload_url": "https://s3.amazonaws.com/originals/vid-uuid/upload?X-Amz-...",
"upload_id": "multipart-upload-id",
"max_chunk_size": 104857600
}Check Transcoding Status
GET /api/v1/videos/{video_id}/status
Response: 200 OK
{
"video_id": "vid-uuid",
"status": "transcoding",
"pipeline_progress": {
"probe": "completed",
"split": "completed",
"transcode": { "completed": 45, "total": 60, "percent": 75 },
"package": "pending",
"thumbnails": "completed",
"moderation": "pending",
"drm": "pending"
},
"estimated_completion": "2025-03-14T11:15:00Z",
"started_at": "2025-03-14T10:45:00Z"
}Retry Failed Pipeline
POST /api/v1/videos/{video_id}/retry
{ "from_step": "transcode" }
Response: 200 OK
{ "status": "retrying", "retry_from": "transcode" }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingMySQL: Video and Job Metadata
CREATE TABLE videos (
video_id VARCHAR(36) PRIMARY KEY,
creator_id VARCHAR(36) NOT NULL,
title VARCHAR(255),
original_s3_key TEXT NOT NULL,
original_format VARCHAR(10),
duration_sec INT,
original_width INT,
original_height INT,
original_codec VARCHAR(20),
original_bitrate_kbps INT,
file_size_bytes BIGINT,
status ENUM('uploaded','probing','transcoding','packaging',
'moderating','ready','failed','removed') DEFAULT 'uploaded',
preset VARCHAR(20) DEFAULT 'standard',
priority ENUM('low','normal','high','urgent') DEFAULT 'normal',
manifest_url TEXT,
thumbnail_url TEXT,
error_message TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
completed_at TIMESTAMP,
INDEX idx_status (status),
INDEX idx_creator (creator_id, created_at DESC)
);
CREATE TABLE transcoding_tasks (
task_id VARCHAR(36) PRIMARY KEY,
video_id VARCHAR(36) NOT NULL,
task_type ENUM('probe','split','transcode','package','thumbnail',
'audio','moderation','drm','subtitle'),
resolution VARCHAR(10),
codec VARCHAR(10),
segment_index INT,
status ENUM('pending','running','completed','failed','retrying'),
worker_id VARCHAR(36),
s3_input_key TEXT,
s3_output_key TEXT,
started_at TIMESTAMP,
completed_at TIMESTAMP,
duration_ms INT,
error_message TEXT,
retry_count INT DEFAULT 0,
INDEX idx_video (video_id, task_type),
INDEX idx_status (status),
INDEX idx_worker (worker_id)
);S3: Storage Structure
Bucket: video-originals (NEVER deleted, cross-region replicated)
/{video_id}/original.mp4
Bucket: video-segments (temporary, auto-delete after 7 days)
/{video_id}/segments/segment_000.mp4
Bucket: video-transcoded (long-term, lifecycle policies)
/{video_id}/h264/720p/segment_000.ts
/{video_id}/h264/720p/playlist.m3u8
/{video_id}/h265/720p/segment_000.ts
/{video_id}/manifest.m3u8
/{video_id}/manifest.mpd
/{video_id}/thumbnails/thumb_001.jpg
/{video_id}/subtitles/en.vtt
/{video_id}/audio/aac_128k.m4aRedis: Pipeline State + Queue Management
pipeline:{video_id} → Hash { status, current_step, transcode_completed, transcode_total, started_at, estimated_completion }
task_queue:transcode:cpu → Sorted Set { task_id: priority_score }
task_queue:transcode:gpu → Sorted Set { task_id: priority_score }
worker:{worker_id} → Hash { status, current_task, last_heartbeat }| Concern | Solution |
|---|---|
| Original file lost | S3 cross-region replication; 11 nines durability; versioning enabled |
| Transcode worker crash | Temporal detects heartbeat timeout → re-schedule on another worker |
| Segment transcode failure | Retry 3 times with backoff; after 3 failures → DLQ, alert ops |
| S3 upload failure | Retry with exponential backoff; S3 multipart for large segments |
| Worker pool exhaustion | Auto-scaling based on queue depth; alert if depth > 1000 for > 10 min |
| Corrupt input video | Probe step detects invalid file → fail fast, notify creator |
| Spot instance preemption | Task checkpointing; preempted task re-queued automatically |
Handle Spot Instance Preemption
GPU instances are expensive. Spot instances save 60-80%. Strategy: use spot for transcoding workers (segment-based, each takes 3-10s, usually finishes before preemption). If preempted: worker marks task as "interrupted", Temporal re-schedules on another worker. Critical path (probe, package, publish) uses on-demand instances.
Quality Verification After Transcoding
Automated checks: duration match (±0.5s), frame count match, VMAF score (> 80 for 720p, > 85 for 1080p), audio-video sync (< 50ms drift), black frame/freeze detection, and bitrate compliance (±20% of target). These checks add ~5 seconds per video but catch 0.1% of errors.
Interview Walkthrough
- Frame upload as async: accept the file to S3, return a job ID immediately — transcoding is minutes-long and must not block the API.
- Walk through the pipeline stages: probe → segment split → parallel transcode (CPU/GPU queues) → package HLS/DASH → publish to CDN origin.
- Explain why Temporal (or similar) orchestrates the workflow — heartbeat timeouts re-schedule failed segments on another worker automatically.
- Cover priority queues in Redis sorted sets: premium creators and trending videos jump ahead of long-tail backlog.
- Mention tiered codec strategy — H.264 for all uploads immediately, H.265/AV1 added only when view counts justify the GPU cost.
- Discuss spot instances for segment workers with checkpointing, keeping probe/package steps on on-demand instances.
- Common pitfall: monolithic FFmpeg on a single worker for a 2-hour 4K video — one crash loses all progress instead of retrying individual segments.
CRF vs CBR vs VBR: Bitrate Control Strategies
| Strategy | Description | Best For |
|---|---|---|
| CBR (Constant Bitrate) | Every second uses same bitrate. Predictable but wastes bits on simple scenes. | Live streaming (consistent bandwidth) |
| VBR (Variable Bitrate) | Bitrate varies by scene complexity. Better perceptual quality. | VOD (pre-recorded) |
| CRF ⭐ (Constant Rate Factor) | Target constant QUALITY, let bitrate vary. Best quality for given size. | VOD transcoding (YouTube, Netflix). Use CRF + maxrate for best of both worlds. |
Netflix's per-title encoding: Encode test segment at multiple CRFs → measure VMAF → pick optimal CRF per video. Animated content (CRF 28) vs action movie (CRF 20). Result: 20-40% bitrate savings over fixed-CRF.
Per-Title vs Per-Shot Encoding
Per-Title: For each video, run convex hull analysis across CRF values and resolutions. Select optimal CRF per resolution maximizing VMAF/bitrate ratio. Up to 40% bandwidth savings.
Per-Shot (state of the art): Split video into shots (scene changes). Each shot gets its own encoding parameters. Dialogue (static): CRF 28, car chase (motion): CRF 20, credits: CRF 30. Additional 10-20% savings over per-title.
Cost Optimization: When to Encode Which Codec
Tier 1 (all videos): H.264 at 480p, 720p, 1080p. Cost: ~$0.02/video.
Tier 2 (> 100 views in first hour): Add H.265. Cost: ~$0.08/video (GPU).
Tier 3 (> 10K views): Add AV1. Cost: ~$0.50/video but saves 30% more bandwidth than H.265. At 10K views, ROI: 330×. Implementation: upload → immediate H.264; hourly cron checks view counts → queue H.265; daily cron queues AV1 for very popular videos.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core video transcoding pipeline flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.