Interview Prompt
Design Video Recommendation Engine.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Two-stage retrieval + ranking (variant of Batch 1 #28), User/item embeddings, Real-time feature updates? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Two-stage retrieval + ranking (variant of Batch 1 #28)
- User/item embeddings
- Real-time feature updates
- Diversity injection
- Capacity estimation with shown math
Out of scope (state explicitly)
- GPU cluster training and hyperparameter tuning
- Content moderation of recommended items
- Ad auction / sponsored placement ranking
Assumptions
- Index staleness of minutes is acceptable unless real-time is stated
- Clarify query QPS vs index update rate early
- Managed search/stream stack (Elasticsearch, Kafka) is fine to propose
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Personalized recommendations: "Videos you might like" feed tailored to each user's watch history, likes, and preferences
- "Up Next" recommendation: After watching a video, suggest what to watch next (autoplay)
- Homepage feed: Curated mix of trending, personalized, and fresh content
- Similar videos: "Because you watched X": find videos related to a specific video
- Category/topic recommendations: "Trending in Technology", "Popular in Music"
- New user cold start: Recommend popular/trending content for users with no history
- Explain recommendations: "Recommended because you watched System Design Interview"
- Feedback loop: Incorporate explicit (like/dislike) and implicit (watch time, skip) signals
- Diversity: Avoid filter bubble; include exploratory recommendations
- Real-time updates: Recommendation changes within minutes of new user behavior
- Low Latency: Recommendations served in < 200 ms
- Scalability: 1B+ users, 500M+ videos, 5B+ watches/day
- Freshness: Newly uploaded videos appear in recommendations within 1 hour
- Quality: Increase average watch time per session (the north star metric)
- Diversity: No more than 30% of recommendations from same creator/category
- Fairness: New creators' content gets a fair chance (exploration vs exploitation)
- Availability: 99.99%: recommendations are the main UI; failure = blank homepage
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given (product assumption) | 500M |
| Videos in catalog | Given (assumption documented in value) | 500M |
| Watches / day | Given (assumption documented in value) | 5B |
| Recommendation requests / sec | From Recommendation requests / day ÷ 86400 (+ peak factor in value) | 200K (homepage loads, up-next) |
| User feature vector size | 256 floats | 1 KB |
| Video feature vector size | 256 floats | 1 KB |
| User features total | 1B × 1 KB | 1 TB |
| Video features total | 500M × 1 KB | 500 GB |
| Model inference latency budget | Given (assumption documented in value) | < 50 ms |
| Pre-computed recs cache | 500M users × 100 recs × 8B | 400 GB |
Two-Stage Architecture: Candidate Generation → Ranking
Event Bus Design (Kafka)
Topic: video_recommendation_engine-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "video_recommendation_engine-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: video_recommendation_engine-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Video Recommendation Engine: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Embedding-Based Retrieval & ANN Search
User and video embeddings in 256-dim vector space. Two-Tower Model training: User tower: user_id embedding + avg watched video embeddings + demographics Video tower: video_id embedding + title embedding + category + channel Objective: Maximize dot(user_emb, pos_video_emb) - dot(user_emb, neg_video_emb) ANN Index (FAISS IVF-PQ): 500M videos × 256 × 4 bytes = 512 GB raw → compressed to 32 GB with IVF-PQ IVF: cluster into 4096 groups → search nearest 50 groups PQ: compress each vector to 64 bytes (16× compression) Query time: < 5 ms per user embedding → serves 200K QPS across 10 replicas
Near-Real-Time Feature Updates
User watches a video at 3:00 PM. By 3:05 PM, recommendations reflect this.
Pipeline:
1. User finishes video → watch event to Kafka
2. Flink streaming job:
a. Update user's recent watch history (last 50)
b. Update user interest vector (exponential moving average)
c. Lightweight user embedding update (avg of last 50 watched)
d. Update video stats (views, avg_watch_pct)
3. Next recommendation request (< 5 min later) uses updated features
Full model retraining (daily):
Spark job: extract all watch events from last 30 days
Train two-tower model on GPU cluster (4-8 hours)
Generate new embeddings for ALL users and videos
Build new FAISS index → blue-green deployment swapPost-Ranking Filters — Diversity & Quality
After ranking model outputs top 100 scored candidates: 1. Diversity filter: Max 3 videos per channel, 30% per category. MMR algorithm. 2. Freshness boost: < 24h old → 1.2×, < 7 days → 1.1× 3. Quality filter: Remove < 40% avg watch pct, high dislike ratio, flagged content 4. Explore vs Exploit: 90% exploitation + 10% exploration (ε-greedy) 5. Business rules: Inject ads, boost premium, suppress blocked creators 6. Dedup: Remove watched, too-similar (pHash), reuploads
Cold Start for New Users & New Videos
New User: - Onboarding: select interests → initialize interest vector - Fallback: popular by country, language, time of day, referral source - Progressive: 80% popular → 40% → 10% as watch history accumulates New Video: - Content-based embedding from title/description/tags (BERT → 256-dim) - Channel prior: inherit channel's avg engagement metrics - Exploration allocation: 10% slots for < 24h old videos - Multi-armed bandit (Thompson Sampling): learn CTR after ~1000 impressions - Blend: blended_emb = α × content_emb + (1-α) × collab_emb (α decays from 1.0 to 0.2)
Get Home Feed Recommendations
GET /api/v1/recommendations/home?limit=20&cursor={last}
Headers: X-User-Id: user-uuid, X-Context: { "device": "mobile", "time_zone": "America/New_York" }
Response: 200 OK
{
"recommendations": [
{
"video_id": "vid-uuid",
"title": "System Design: URL Shortener",
"channel": "TechChannel",
"thumbnail": "https://cdn.example.com/thumb/vid-uuid.webp",
"duration": 1200,
"view_count": 1200000,
"published_at": "2025-03-10",
"score": 0.95,
"reason": "Because you watched 'System Design Interview Guide'",
"source": "collaborative_filtering"
}
]
}Get Up Next
GET /api/v1/recommendations/up-next?current_video={video_id}&watch_time=600&total_duration=1200
Response: 200 OK
{
"up_next": {
"video_id": "vid-next",
"title": "System Design: API Rate Limiter",
"reason": "Next in series",
"autoplay_in_seconds": 5
},
"related": [...]
}Get Similar Videos
GET /api/v1/recommendations/similar/{video_id}?limit=10
Response: 200 OK
{
"similar": [
{"video_id": "...", "title": "...", "similarity_score": 0.89, "reason": "Similar topic"}
]
}Send Feedback
POST /api/v1/recommendations/feedback
{
"video_id": "vid-uuid",
"action": "not_interested",
"reason": "already_watched"
}
Response: 200 OKCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
Redis: Feature Store + Cache
# User features (updated in near-real-time by Flink)
user_features:{user_id} → Hash {
embedding: binary(1024),
preferred_categories: "tech,science,education",
avg_watch_duration: 480,
activity_level: "high",
country: "US",
language: "en",
last_updated: 1710400000
}
# User's recent watch history
user_watch_history:{user_id} → List [video_id_1, ..., video_id_50]
# Video features
video_features:{video_id} → Hash {
embedding: binary(1024), category: "technology", channel_id: "ch-uuid",
duration: 1200, view_count: 1200000, like_ratio: 0.95, avg_watch_pct: 0.72
}
TTL: 86400
# Pre-computed similar videos (item-item)
similar:{video_id} → List [vid_1, ..., vid_100]
TTL: 86400
# Cached recommendations per user
recs:{user_id}:{context} → List [vid_1, ..., vid_100]
TTL: 300
# Trending videos
trending:global → Sorted Set
trending:{category} → Sorted Set
TTL: 900Kafka Topics
Topic: watch-events (user watch behavior → training data + real-time features) Topic: recommendation-served (what was shown → for CTR computation) Topic: recommendation-clicked (what was clicked → for CTR computation) Topic: video-published (new video → needs embedding generation) Topic: model-events (model deployed, A/B test started/ended)
| Concern | Solution |
|---|---|
| Feature store (Redis) down | Serve from pre-computed cached recommendations; degrade to trending/popular |
| FAISS index unavailable | Fall back to pre-computed similar videos + trending; skip ANN retrieval |
| Ranking model error | Circuit breaker → serve candidates by score from candidate generation (skip ranking) |
| Cold user (no history) | Trending + popular + category-based recommendations |
| Cold video (just uploaded) | Use content-based features (title, description, category) for initial embedding |
| Embedding drift | Daily retrain corrects drift; monitor embedding quality metrics |
| A/B test regression | Auto-rollback if treatment decreases avg watch time by > 2% |
| Thundering herd (homepage) | Pre-compute recommendations for active users every 5 minutes |
Interview Walkthrough
- Frame as a two-stage funnel: candidate generation (billions → hundreds) then ranking (hundreds → 20) — never rank the full catalog per request.
- Explain the two-tower model for Stage 1: offline user/video embeddings with ANN search (FAISS/ScaNN) for sub-10 ms retrieval.
- Cover Stage 2 cross-network ranking on ~200 candidates with fresh features from a Flink-updated feature store in Redis.
- Discuss the hybrid pre-compute pattern: refresh base recs every 4 hours, re-rank on request with real-time watch signals.
- Mention feedback loop: implicit signals (watch time, skip) weighted over CTR to avoid clickbait — log everything for model retraining.
- Address filter bubble: ε-greedy exploration and diversity constraints (max 3 per channel, 30% per category).
- Common pitfall: using a cross-network model for candidate generation — it cannot ANN-search and requires per-pair inference at billion scale.
Two-Tower vs Cross-Network
Two-Tower Model (candidate generation): ✓ Offline computation, ANN-searchable, scales to billions ✗ Limited interaction modeling (no cross-features) Cross-Network (ranking): ✓ Models complex feature interactions ✗ Requires per-pair forward pass (1000 inferences/request) ✗ Can't do ANN search YouTube's architecture: Two-Tower (Stage 1) + Wide & Deep (Stage 2)
Watch Time vs CTR
CTR: Maximize P(click) → clickbait wins, low satisfaction Watch Time: Maximize E[watch_time] → rewards engaging content ✗ Bias toward long videos (60 min × 50% = 30 > 5 min × 100% = 5) Best: E[watch_time × watch_percentage] + multi-objective YouTube 2019+: w1×E[watch_time] + w2×P(like) + w3×P(share) - w4×P(dislike)
Batch Pre-Computation vs Real-Time Inference
Hybrid approach ⭐ (YouTube/Netflix):
1. Pre-compute "base" recs every 4 hours (recs_base:{user_id} → top 200)
2. Real-time personalization on request:
- Read base from Redis
- Fetch fresh user features (Flink-updated)
- Re-rank 200 candidates with latest features (fast)
- Apply post-rank filters
3. If base stale → trigger full pipeline, show cached while computing
Result: < 15 ms latency with near-real-time personalizationFilter Bubble Prevention
Strategies: 1. ε-greedy: 10% random high-quality content 2. Thompson Sampling: explore where model is uncertain 3. Contextual Bandits: learn when to explore 4. Interest expansion: suggest boundary topics 5. Diversity constraints: max 3 per channel, 30% per category, min 2 fresh
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core video recommendation engine flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.