Design a Video Recommendation Engine

Interview Prompt

Design Video Recommendation Engine.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Two-stage retrieval + ranking (variant of Batch 1 #28), User/item embeddings, Real-time feature updates?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Two-stage retrieval + ranking (variant of Batch 1 #28)
User/item embeddings
Real-time feature updates
Diversity injection
Capacity estimation with shown math

Out of scope (state explicitly)

GPU cluster training and hyperparameter tuning
Content moderation of recommended items
Ad auction / sponsored placement ranking

Assumptions

Index staleness of minutes is acceptable unless real-time is stated
Clarify query QPS vs index update rate early
Managed search/stream stack (Elasticsearch, Kafka) is fine to propose

Personalized recommendations: "Videos you might like" feed tailored to each user's watch history, likes, and preferences
"Up Next" recommendation: After watching a video, suggest what to watch next (autoplay)
Homepage feed: Curated mix of trending, personalized, and fresh content
Similar videos: "Because you watched X": find videos related to a specific video
Category/topic recommendations: "Trending in Technology", "Popular in Music"
New user cold start: Recommend popular/trending content for users with no history
Explain recommendations: "Recommended because you watched System Design Interview"
Feedback loop: Incorporate explicit (like/dislike) and implicit (watch time, skip) signals
Diversity: Avoid filter bubble; include exploratory recommendations
Real-time updates: Recommendation changes within minutes of new user behavior

Metric	Calculation	Value
DAU	Given (product assumption)	500M
Videos in catalog	Given (assumption documented in value)	500M
Watches / day	Given (assumption documented in value)	5B
Recommendation requests / sec	From Recommendation requests / day ÷ 86400 (+ peak factor in value)	200K (homepage loads, up-next)
User feature vector size	256 floats	1 KB
Video feature vector size	256 floats	1 KB
User features total	1B × 1 KB	1 TB
Video features total	500M × 1 KB	500 GB
Model inference latency budget	Given (assumption documented in value)	< 50 ms
Pre-computed recs cache	500M users × 100 recs × 8B	400 GB

Loading...

Two-Stage Architecture: Candidate Generation → Ranking

Event Bus Design (Kafka)

Topic: video_recommendation_engine-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "video_recommendation_engine-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: video_recommendation_engine-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Video Recommendation Engine: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Embedding-Based Retrieval & ANN Search

User and video embeddings in 256-dim vector space.

Two-Tower Model training:
  User tower: user_id embedding + avg watched video embeddings + demographics
  Video tower: video_id embedding + title embedding + category + channel
  Objective: Maximize dot(user_emb, pos_video_emb) - dot(user_emb, neg_video_emb)

ANN Index (FAISS IVF-PQ):
  500M videos × 256 × 4 bytes = 512 GB raw → compressed to 32 GB with IVF-PQ
  IVF: cluster into 4096 groups → search nearest 50 groups
  PQ: compress each vector to 64 bytes (16× compression)
  Query time: < 5 ms per user embedding → serves 200K QPS across 10 replicas

Near-Real-Time Feature Updates

User watches a video at 3:00 PM. By 3:05 PM, recommendations reflect this.

Pipeline:
  1. User finishes video → watch event to Kafka
  2. Flink streaming job:
     a. Update user's recent watch history (last 50)
     b. Update user interest vector (exponential moving average)
     c. Lightweight user embedding update (avg of last 50 watched)
     d. Update video stats (views, avg_watch_pct)
  3. Next recommendation request (< 5 min later) uses updated features

Full model retraining (daily):
  Spark job: extract all watch events from last 30 days
  Train two-tower model on GPU cluster (4-8 hours)
  Generate new embeddings for ALL users and videos
  Build new FAISS index → blue-green deployment swap

Post-Ranking Filters — Diversity & Quality

After ranking model outputs top 100 scored candidates:

1. Diversity filter: Max 3 videos per channel, 30% per category. MMR algorithm.
2. Freshness boost: < 24h old → 1.2×, < 7 days → 1.1×
3. Quality filter: Remove < 40% avg watch pct, high dislike ratio, flagged content
4. Explore vs Exploit: 90% exploitation + 10% exploration (ε-greedy)
5. Business rules: Inject ads, boost premium, suppress blocked creators
6. Dedup: Remove watched, too-similar (pHash), reuploads

Cold Start for New Users & New Videos

New User:
  - Onboarding: select interests → initialize interest vector
  - Fallback: popular by country, language, time of day, referral source
  - Progressive: 80% popular → 40% → 10% as watch history accumulates

New Video:
  - Content-based embedding from title/description/tags (BERT → 256-dim)
  - Channel prior: inherit channel's avg engagement metrics
  - Exploration allocation: 10% slots for < 24h old videos
  - Multi-armed bandit (Thompson Sampling): learn CTR after ~1000 impressions
  - Blend: blended_emb = α × content_emb + (1-α) × collab_emb (α decays from 1.0 to 0.2)

Concern	Solution
Feature store (Redis) down	Serve from pre-computed cached recommendations; degrade to trending/popular
FAISS index unavailable	Fall back to pre-computed similar videos + trending; skip ANN retrieval
Ranking model error	Circuit breaker → serve candidates by score from candidate generation (skip ranking)
Cold user (no history)	Trending + popular + category-based recommendations
Cold video (just uploaded)	Use content-based features (title, description, category) for initial embedding
Embedding drift	Daily retrain corrects drift; monitor embedding quality metrics
A/B test regression	Auto-rollback if treatment decreases avg watch time by > 2%
Thundering herd (homepage)	Pre-compute recommendations for active users every 5 minutes

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Two-Stage Architecture: Candidate Generation → Ranking

Event Bus Design (Kafka)

Embedding-Based Retrieval & ANN Search

Near-Real-Time Feature Updates

Post-Ranking Filters — Diversity & Quality

Cold Start for New Users & New Videos

Get Home Feed Recommendations

Get Up Next

Get Similar Videos

Send Feedback

Common Error Responses

Redis: Feature Store + Cache

Kafka Topics

Interview Walkthrough

Two-Tower vs Cross-Network

Watch Time vs CTR

Batch Pre-Computation vs Real-Time Inference

Filter Bubble Prevention

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR