This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | ML systems design — nail the retrieval→ranking funnel, two-tower architecture, ANN candidate generation, feature store role, cold start, and explore/exploit (multi-armed bandit). Show you know the pipeline, not just collaborative filtering. |
| Arch 50 | Add real-time feature freshness, A/B testing infrastructure, model serving latency budgets, and re-ranking for diversity. |
| Arch 75 | Staff: discuss training/serving skew, embedding drift, feedback loops (filter bubble), and cost of billion-scale ANN. |
Interview Prompt
Design a recommendation system for an e-commerce platform that suggests products to users. Support personalized recommendations for logged-in users, reasonable defaults for new users (cold start), and balance exploitation (show what we know works) with exploration (discover new preferences). Handle 100M users and 10M products.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What surfaces need recommendations — homepage, product page, email? | Homepage needs diverse categories; product page needs similar items — different models/pipelines. |
| Real-time personalization or batch-updated? | Session-based (real-time clicks) vs daily batch (collaborative filtering matrix) — different latency budgets. |
| What signals do we have — clicks, purchases, ratings, dwell time? | Purchase signal is sparse but high-quality; click is dense but noisy. Drives training data design. |
| How do we measure success — CTR, conversion, revenue, diversity? | Optimizing CTR alone creates filter bubble; need multi-objective ranking. |
Scope
In scope
- Two-tower model for retrieval
- ANN (Approximate Nearest Neighbor) candidate generation
- Ranking model (LTR) with feature store
- Cold start for new users and new items
- Explore/exploit balance (multi-armed bandit)
- Real-time and batch feature pipeline
Out of scope (state explicitly)
- Full model training pipeline (GPU cluster, hyperparameter tuning)
- Content moderation of recommended items
- Ad auction / sponsored recommendations
Assumptions
- 100M users, 10M products, 1B interactions/day
- 100ms p99 latency budget for recommendation API
- Return top-20 recommendations per request
- Logged-in users (80%) + anonymous (20%)
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Personalized recommendations: Show content tailored to each user's taste
- Multiple surfaces: Home feed, "Because you watched X", "Trending", category pages
- Real-time signals: Incorporate recent user actions (watch, like, skip) within minutes
- Cold start handling: Recommendations for brand-new users with no history
- Diversity: Avoid filter bubbles; expose users to varied content
- Explainability: "Because you watched Stranger Things" or "Trending in your area"
- Low Latency: Recommendations served in < 200 ms
- High Throughput: Serve 100K+ recommendation requests/sec
- Freshness: New content surfaced within hours of upload
- Scalability: 500M+ users, 100M+ items (videos/movies/songs)
- Offline + Online: Batch training (daily) + real-time feature updates
- A/B Testable: Every model change tested via controlled experiments
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given | 300M |
| Rec requests / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 100K |
| Items in catalog | Given | 100M |
| User-item interactions / day | Given | 10B |
| Model size | Given | 10-50 GB |
| Feature store entries | Given | 300M users + 100M items |
| Candidate generation latency | Given | < 50 ms |
| Ranking latency budget | Given | < 100 ms |
Three-Stage Pipeline
1. Candidate Generation (ANN search) — 3K candidates in ~30ms Two-Tower Model: user_embedding · item_embeddings → ANN (HNSW in Faiss) 2. Ranking (DNN) — 5K items scored in ~50ms via GPU batch inference Deep neural network with ~200 features per (user, item) pair 3. Re-ranking (Business Rules) — Apply diversity, freshness, filters in ~10ms
Cold Start: New Users and New Items
New user (no watch history): 1. Demographics-based: recommend popular items for age group + region 2. Onboarding survey: "What genres do you like?" → seed preferences 3. Exploration-heavy: show diverse content, learn from first 10 interactions 4. Bandits (explore/exploit): Multi-Armed Bandit selects between diverse candidates New item (no engagement data): 1. Content-based features: genre, tags, description, cast → predict similar items 2. Creator signal: if creator's past items performed well → boost new item 3. Guaranteed exposure: every new item gets shown to N users (e.g., 1000) 4. Freshness boost in re-ranking: new items get score multiplier for first 48 hours
Embedding Generation: Two-Tower Model
Tower 1 (User encoder): Input: user watch history, demographics, preferences Output: user_embedding (128-dim float vector) Architecture: Transformer encoder over watch sequence Tower 2 (Item encoder): Input: item metadata (genre, tags, description), engagement stats Output: item_embedding (128-dim float vector) Architecture: MLP over concatenated features Training: Contrastive learning (positive = watched to completion, negative = random) Inference: dot_product(user_emb, item_embs) → HNSW ANN search → top 3K in ~10ms
Event Bus Design (Kafka)
Topic: recommendation_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "recommendation_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: recommendation_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Recommendation System (Netflix / TikTok Style): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
GET /api/v1/recommendations?surface=home&count=20&cursor=0
→ 200 OK
{
"items": [
{ "item_id": "movie-123", "title": "...", "score": 0.95,
"reason": "Because you watched Stranger Things" },
...
],
"cursor": "20"
}
POST /api/v1/events
{ "user_id": "u1", "item_id": "movie-123", "event": "watch_complete",
"duration_sec": 3600, "timestamp": "..." }
→ 202 AcceptedCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
Feature Store (Redis)
user:{uid}:embedding → Binary (128 floats × 4B = 512 bytes)
user:{uid}:history → List (last 100 item_ids)
user:{uid}:profile → Hash {country, age_group, language, signup_date}
item:{iid}:embedding → Binary (512 bytes)
item:{iid}:stats → Hash {completion_rate, like_rate, view_count}
watched:{uid} → Set (item_ids watched, TTL: 30 days)
feed:{uid} → List (pre-computed recommendations, TTL: 30 min)Model Artifacts (S3)
s3://models/collaborative-filtering/v42/ ├── user_embeddings.npy (300M × 128 × 4 = ~150 GB) ├── item_embeddings.npy (100M × 128 × 4 = ~50 GB) └── model_metadata.json s3://models/ranking-dnn/v15/ ├── saved_model.pb └── model_config.json
Vector Index (Faiss/Milvus)
HNSW index over 100M item embeddings (128-dim) Build time: ~2 hours on GPU Search time: ~5ms for top-3K nearest neighbors Memory: ~55 GB (embeddings + graph structure) Update: nightly rebuild or real-time incremental insert
| Concern | Solution |
|---|---|
| Model serving failure | Fallback to previous model version (blue-green deploy) |
| Feature store down | Serve from cached feed; degrade to popularity-based recs |
| ANN index stale | Rebuild nightly; new items get guaranteed exposure via exploration |
| Training data poisoning | Filter bot/spam interactions before training |
| Cold start | Demographics + onboarding survey + exploration |
| Filter bubble | 10% of recommendations are random exploration items |
1. Feature Staleness
User watches "The Conjuring" at T=0. Refreshes feed at T=5s → features still show old interests.
Mitigations:
1. Client-side context: Pass last_watched_item_id in request headers
Ranking model uses this as real-time context feature (no pipeline delay)
2. Session-level re-ranking: Boost items similar to recently watched
3. Accept 30-second delay: for most users, this is invisible2. Popularity Bias: Rich Get Richer
Solution: Exploration budget
90% of feed: model-scored recommendations (exploitation)
10% of feed: random items from underexposed pool (exploration)
Thompson Sampling: Select exploration items using Bayesian bandits
Each item has a Beta distribution of (successes, failures)
Sample from each distribution → highest sample gets shown
Balances exploration of uncertain items vs exploitation of known good onesBatch vs Real-Time Recommendations
Hybrid (Netflix/TikTok approach): Batch pre-computes candidate set (top 500) — updated every 30 min Real-time ranking re-scores candidates using latest features — per request Re-ranking applies business rules — per request Total latency: ~150 ms Freshness: ranking uses real-time features (last watched, time of day)
Model Serving Infrastructure: GPU Inference
Batching: Score 5K items in ONE forward pass through the DNN. GPU parallelism: 5K scores in ~50ms (vs 5K × 1ms = 5s sequential) Latency budget: Feature fetch: 20ms (Redis) Candidate generation: 30ms (ANN search) Ranking inference: 50ms (GPU batch) Re-ranking: 10ms Total: ~110ms
Real-Time Feature Pipeline
User action → Kafka → Flink →
1. Update user embedding (lightweight, < 5ms) → write to Redis
2. Update engagement counters: INCR user:{uid}:genre_watch_count:horror
3. Update session-level features: LPUSH user:{uid}:session_history {item_id}
Next request (at +1min): reads updated embedding → ANN returns horror-adjacent candidates
→ Ranking model uses updated genre counts → horror movies score higher
Feature freshness tiers:
Real-time (< 1 min): last watched, session history, time of day
Near-real-time (< 30 min): updated embeddings, genre preferences
Batch (daily): user demographics, long-term preferencesInterview Walkthrough
- Structure the answer as a multi-stage funnel: candidate retrieval (millions → hundreds) → ranking → business-rule re-ranking.
- Generate candidates with two-tower embeddings and ANN search (FAISS/HNSW) — brute-force scoring every catalog item violates the latency budget.
- Pre-compute top-500 candidates in batch (every 30 min); re-score per request with real-time features from Redis (last watched, session history).
- Serve ranking inference on GPU with request batching — score 5K items in one forward pass (~50ms) instead of 5K sequential calls.
- Inject explore/exploit in re-ranking: reserve 10–15% of slots for new items to solve cold-start and gather engagement data.
- Stream user actions to Kafka → Flink to update embeddings and genre counters within 1 minute of a watch event.
- State latency budget explicitly: feature fetch (20ms) + ANN (30ms) + GPU rank (50ms) + re-rank (10ms) ≈ 110ms total.
- Common pitfall: proposing a single monolithic model that scores the full catalog per request — interviewers expect retrieval + rank separation.
Collaborative vs Content-Based vs Hybrid
| Approach | How | Pros | Cons |
|---|---|---|---|
| Collaborative | "Users like you liked X" | Discovers unexpected interests | Cold start; popularity bias |
| Content-Based | "Similar to what you liked" | No cold start for items; transparent | Limited serendipity |
| Hybrid | Multiple candidate sources, unified ranker | Best of all worlds | System complexity |
A/B Testing: How to Validate Model Changes
Control group (50%): current production model (v14) Treatment group (50%): new candidate model (v15) Users deterministically assigned: hash(user_id) % 100 < 50 Key metrics: 1. Primary: Watch time / session 2. Secondary: CTR on recommendations 3. Guardrail: Retention (did users come back next day?) 4. Counter-metric: Diversity Duration: Minimum 7 days Statistical significance: p < 0.05, minimum detectable effect = 0.5% Common pitfall: "Engagement trap" — model optimizes for clicks → clickbait Solution: Optimize for COMPLETION rate, not just clicks
Feedback Loops and Position Bias
Problem 1: Position bias Items at position 1 get 10× more clicks than position 5. Solution: Remove position feature during training. Problem 2: Popularity feedback loop Popular items → recommended more → more engagement → even more popular Solution: Exploration budget + Inverse propensity weighting Problem 3: "Rabbit hole" effect User watches one conspiracy theory → algorithm feeds more → radicalization Solution: Max 3 videos from same category in a row, no category > 40% of feed
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Popularity + content-based baseline
Batch job computes trending products per category (Spark). Content-based: product category/tag matching for logged-in users with preferences. Redis cache for top-100 per category. No ML model — rule-based.
Key components: Spark batch jobs · Redis cache · Category trending · Content-based matching · A/B test framework
Move to next phase when: CTR plateau on popularity; need personalization from interaction data
Phase 2 — Two-tower retrieval + LTR ranking
Two-tower model trained on click/purchase data. Faiss ANN index for candidate generation. XGBoost ranker with feature store (Feast). Explore/exploit via ε-greedy. Daily model retrain + index rebuild.
Key components: Two-tower model · Faiss ANN index · XGBoost ranker · Feast feature store · ε-greedy explore
Move to next phase when: 100ms latency budget exceeded; need real-time features and incremental index updates
Phase 3 — Real-time personalization at scale
Streaming feature pipeline (Flink) for session features. Incremental ANN index updates for new items. Multi-objective ranker (CTR + revenue + diversity). Contextual bandit for explore/exploit. Shadow scoring for skew detection. GPU inference for ranker.
Key components: Flink streaming features · Incremental ANN · Multi-objective ranker · Contextual bandit · GPU serving · Shadow scoring
Move to next phase when: Model complexity exceeds CPU serving capacity; need GPU inference cluster
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Recommendation API latency | < 100ms p99 | Homepage blocks on recommendations — user-visible |
| Feature freshness (real-time) | < 30s | Session clicks must influence same-session recommendations |
| ANN index freshness | < 24h | New items appear in recommendations within 1 day |
| Recommendation CTR (online) | > baseline +5% | Must beat popularity baseline to justify ML complexity |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Feature store outage — ranker returns default scores | Feature fetch error rate > 50%; recommendation quality drops; CTR alert | Fallback to two-tower retrieval-only (no ranker features); serve popularity baseline; page on-call; feature store has Redis replica failover |
| ANN index corruption after bad model deploy | Recommendation diversity drops to zero; same items for all users; NDCG offline metric crashed | Rollback to previous index snapshot (S3 versioned); disable new model; serve popularity fallback; post-mortem on index validation gate |
| Feedback loop — filter bubble detected | Category diversity in top-20 drops below 2; user cohort analysis shows narrowing preferences | Increase explore rate (10% → 20%); enable diversity re-ranker; inject serendipity candidates; schedule model retrain with diversity loss term |
Cost Drivers (Staff lens)
- ANN index storage: 10M items × 128-dim × 4 bytes ≈ 5 GB (modest)
- Feature store (Redis): 100M users × 50 features × 8 bytes ≈ 40 GB
- GPU inference for ranker: 500 candidates × 100 RPS = dominant compute cost
- Training pipeline (Spark/GPU): daily retrain ≈ $5-10K/run at this scale
Multi-Region & DR
Recommendations are regional — user history and inventory differ by market. Per-region ANN index and feature store. Global model training on all data; regional fine-tuning optional. Cross-region: don't serve US recommendations to EU users (inventory/regulatory differences).