Design a Recommendation System (Netflix / TikTok Style)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	ML systems design — nail the retrieval→ranking funnel, two-tower architecture, ANN candidate generation, feature store role, cold start, and explore/exploit (multi-armed bandit). Show you know the pipeline, not just collaborative filtering.
Arch 50	Add real-time feature freshness, A/B testing infrastructure, model serving latency budgets, and re-ranking for diversity.
Arch 75	Staff: discuss training/serving skew, embedding drift, feedback loops (filter bubble), and cost of billion-scale ANN.

Interview Prompt

Design a recommendation system for an e-commerce platform that suggests products to users. Support personalized recommendations for logged-in users, reasonable defaults for new users (cold start), and balance exploitation (show what we know works) with exploration (discover new preferences). Handle 100M users and 10M products.

Clarifying Questions (ask before designing)

Question	Why it matters
What surfaces need recommendations — homepage, product page, email?	Homepage needs diverse categories; product page needs similar items — different models/pipelines.
Real-time personalization or batch-updated?	Session-based (real-time clicks) vs daily batch (collaborative filtering matrix) — different latency budgets.
What signals do we have — clicks, purchases, ratings, dwell time?	Purchase signal is sparse but high-quality; click is dense but noisy. Drives training data design.
How do we measure success — CTR, conversion, revenue, diversity?	Optimizing CTR alone creates filter bubble; need multi-objective ranking.

Scope

In scope

Two-tower model for retrieval
ANN (Approximate Nearest Neighbor) candidate generation
Ranking model (LTR) with feature store
Cold start for new users and new items
Explore/exploit balance (multi-armed bandit)
Real-time and batch feature pipeline

Out of scope (state explicitly)

Full model training pipeline (GPU cluster, hyperparameter tuning)
Content moderation of recommended items
Ad auction / sponsored recommendations

Assumptions

100M users, 10M products, 1B interactions/day
100ms p99 latency budget for recommendation API
Return top-20 recommendations per request
Logged-in users (80%) + anonymous (20%)

Personalized recommendations: Show content tailored to each user's taste
Multiple surfaces: Home feed, "Because you watched X", "Trending", category pages
Real-time signals: Incorporate recent user actions (watch, like, skip) within minutes
Cold start handling: Recommendations for brand-new users with no history
Diversity: Avoid filter bubbles; expose users to varied content
Explainability: "Because you watched Stranger Things" or "Trending in your area"

Metric	Calculation	Value
DAU	Given	300M
Rec requests / sec	Derived from daily volume ÷ 86400 (+ peak factor)	100K
Items in catalog	Given	100M
User-item interactions / day	Given	10B
Model size	Given	10-50 GB
Feature store entries	Given	300M users + 100M items
Candidate generation latency	Given	< 50 ms
Ranking latency budget	Given	< 100 ms

Loading...

Three-Stage Pipeline

1. Candidate Generation (ANN search) — 3K candidates in ~30ms
   Two-Tower Model: user_embedding · item_embeddings → ANN (HNSW in Faiss)

2. Ranking (DNN) — 5K items scored in ~50ms via GPU batch inference
   Deep neural network with ~200 features per (user, item) pair

3. Re-ranking (Business Rules) — Apply diversity, freshness, filters in ~10ms

Cold Start: New Users and New Items

New user (no watch history):
  1. Demographics-based: recommend popular items for age group + region
  2. Onboarding survey: "What genres do you like?" → seed preferences
  3. Exploration-heavy: show diverse content, learn from first 10 interactions
  4. Bandits (explore/exploit): Multi-Armed Bandit selects between diverse candidates

New item (no engagement data):
  1. Content-based features: genre, tags, description, cast → predict similar items
  2. Creator signal: if creator's past items performed well → boost new item
  3. Guaranteed exposure: every new item gets shown to N users (e.g., 1000)
  4. Freshness boost in re-ranking: new items get score multiplier for first 48 hours

Embedding Generation: Two-Tower Model

Tower 1 (User encoder):
  Input: user watch history, demographics, preferences
  Output: user_embedding (128-dim float vector)
  Architecture: Transformer encoder over watch sequence

Tower 2 (Item encoder):
  Input: item metadata (genre, tags, description), engagement stats
  Output: item_embedding (128-dim float vector)
  Architecture: MLP over concatenated features

Training: Contrastive learning (positive = watched to completion, negative = random)
Inference: dot_product(user_emb, item_embs) → HNSW ANN search → top 3K in ~10ms

Event Bus Design (Kafka)

Topic: recommendation_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "recommendation_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: recommendation_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Recommendation System (Netflix / TikTok Style): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

HTTP

GET /api/v1/recommendations?surface=home&count=20&cursor=0
→ 200 OK
{
  "items": [
    { "item_id": "movie-123", "title": "...", "score": 0.95,
      "reason": "Because you watched Stranger Things" },
    ...
  ],
  "cursor": "20"
}

POST /api/v1/events
{ "user_id": "u1", "item_id": "movie-123", "event": "watch_complete",
  "duration_sec": 3600, "timestamp": "..." }
→ 202 Accepted

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

Concern	Solution
Model serving failure	Fallback to previous model version (blue-green deploy)
Feature store down	Serve from cached feed; degrade to popularity-based recs
ANN index stale	Rebuild nightly; new items get guaranteed exposure via exploration
Training data poisoning	Filter bot/spam interactions before training
Cold start	Demographics + onboarding survey + exploration
Filter bubble	10% of recommendations are random exploration items

1. Feature Staleness

User watches "The Conjuring" at T=0. Refreshes feed at T=5s → features still show old interests.

Mitigations:
  1. Client-side context: Pass last_watched_item_id in request headers
     Ranking model uses this as real-time context feature (no pipeline delay)
  2. Session-level re-ranking: Boost items similar to recently watched
  3. Accept 30-second delay: for most users, this is invisible

2. Popularity Bias: Rich Get Richer

Solution: Exploration budget
  90% of feed: model-scored recommendations (exploitation)
  10% of feed: random items from underexposed pool (exploration)
  
  Thompson Sampling: Select exploration items using Bayesian bandits
    Each item has a Beta distribution of (successes, failures)
    Sample from each distribution → highest sample gets shown
    Balances exploration of uncertain items vs exploitation of known good ones

Batch vs Real-Time Recommendations

Hybrid (Netflix/TikTok approach):
  Batch pre-computes candidate set (top 500) — updated every 30 min
  Real-time ranking re-scores candidates using latest features — per request
  Re-ranking applies business rules — per request
  
  Total latency: ~150 ms
  Freshness: ranking uses real-time features (last watched, time of day)

Model Serving Infrastructure: GPU Inference

Batching: Score 5K items in ONE forward pass through the DNN.
  GPU parallelism: 5K scores in ~50ms (vs 5K × 1ms = 5s sequential)

Latency budget:
  Feature fetch: 20ms (Redis)
  Candidate generation: 30ms (ANN search)
  Ranking inference: 50ms (GPU batch)
  Re-ranking: 10ms
  Total: ~110ms

Real-Time Feature Pipeline

User action → Kafka → Flink →
  1. Update user embedding (lightweight, < 5ms) → write to Redis
  2. Update engagement counters: INCR user:{uid}:genre_watch_count:horror
  3. Update session-level features: LPUSH user:{uid}:session_history {item_id}

Next request (at +1min): reads updated embedding → ANN returns horror-adjacent candidates
  → Ranking model uses updated genre counts → horror movies score higher

Feature freshness tiers:
  Real-time (< 1 min): last watched, session history, time of day
  Near-real-time (< 30 min): updated embeddings, genre preferences
  Batch (daily): user demographics, long-term preferences

Interview Walkthrough

Structure the answer as a multi-stage funnel: candidate retrieval (millions → hundreds) → ranking → business-rule re-ranking.
Generate candidates with two-tower embeddings and ANN search (FAISS/HNSW) — brute-force scoring every catalog item violates the latency budget.
Pre-compute top-500 candidates in batch (every 30 min); re-score per request with real-time features from Redis (last watched, session history).
Serve ranking inference on GPU with request batching — score 5K items in one forward pass (~50ms) instead of 5K sequential calls.
Inject explore/exploit in re-ranking: reserve 10–15% of slots for new items to solve cold-start and gather engagement data.
Stream user actions to Kafka → Flink to update embeddings and genre counters within 1 minute of a watch event.
State latency budget explicitly: feature fetch (20ms) + ANN (30ms) + GPU rank (50ms) + re-rank (10ms) ≈ 110ms total.
Common pitfall: proposing a single monolithic model that scores the full catalog per request — interviewers expect retrieval + rank separation.

Collaborative vs Content-Based vs Hybrid

Approach	How	Pros	Cons
Collaborative	"Users like you liked X"	Discovers unexpected interests	Cold start; popularity bias
Content-Based	"Similar to what you liked"	No cold start for items; transparent	Limited serendipity
Hybrid	Multiple candidate sources, unified ranker	Best of all worlds	System complexity

A/B Testing: How to Validate Model Changes

Control group (50%): current production model (v14)
Treatment group (50%): new candidate model (v15)
Users deterministically assigned: hash(user_id) % 100 < 50

Key metrics:
  1. Primary: Watch time / session
  2. Secondary: CTR on recommendations
  3. Guardrail: Retention (did users come back next day?)
  4. Counter-metric: Diversity

Duration: Minimum 7 days
Statistical significance: p < 0.05, minimum detectable effect = 0.5%

Common pitfall: "Engagement trap" — model optimizes for clicks → clickbait
  Solution: Optimize for COMPLETION rate, not just clicks

Feedback Loops and Position Bias

Problem 1: Position bias
  Items at position 1 get 10× more clicks than position 5.
  Solution: Remove position feature during training.

Problem 2: Popularity feedback loop
  Popular items → recommended more → more engagement → even more popular
  Solution: Exploration budget + Inverse propensity weighting

Problem 3: "Rabbit hole" effect
  User watches one conspiracy theory → algorithm feeds more → radicalization
  Solution: Max 3 videos from same category in a row, no category > 40% of feed

SLOs & Error Budgets

Metric	Target	Rationale
Recommendation API latency	< 100ms p99	Homepage blocks on recommendations — user-visible
Feature freshness (real-time)	< 30s	Session clicks must influence same-session recommendations
ANN index freshness	< 24h	New items appear in recommendations within 1 day
Recommendation CTR (online)	> baseline +5%	Must beat popularity baseline to justify ML complexity

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Feature store outage — ranker returns default scores	Feature fetch error rate > 50%; recommendation quality drops; CTR alert	Fallback to two-tower retrieval-only (no ranker features); serve popularity baseline; page on-call; feature store has Redis replica failover
ANN index corruption after bad model deploy	Recommendation diversity drops to zero; same items for all users; NDCG offline metric crashed	Rollback to previous index snapshot (S3 versioned); disable new model; serve popularity fallback; post-mortem on index validation gate
Feedback loop — filter bubble detected	Category diversity in top-20 drops below 2; user cohort analysis shows narrowing preferences	Increase explore rate (10% → 20%); enable diversity re-ranker; inject serendipity candidates; schedule model retrain with diversity loss term

Cost Drivers (Staff lens)

ANN index storage: 10M items × 128-dim × 4 bytes ≈ 5 GB (modest)
Feature store (Redis): 100M users × 50 features × 8 bytes ≈ 40 GB
GPU inference for ranker: 500 candidates × 100 RPS = dominant compute cost
Training pipeline (Spark/GPU): daily retrain ≈ $5-10K/run at this scale

Multi-Region & DR

Recommendations are regional — user history and inventory differ by market. Per-region ANN index and feature store. Global model training on all data; regional fine-tuning optional. Cross-region: don't serve US recommendations to EU users (inventory/regulatory differences).