This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design a search ranking system that takes a user query and returns the most relevant documents from billions of indexed pages. The system should improve over time using click feedback.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What's the corpus size and query QPS? | 100K QPS with 100B docs drives ANN retrieval + two-stage ranking, not brute-force scoring. |
| Offline metric vs online metric — which do we optimize? | NDCG@10 offline; CTR@1 online. They can diverge — position bias in click data. |
| Personalization required or global ranking? | Personalization adds user feature store latency budget (~5ms) and cold-start complexity. |
| Latency budget for full pipeline? | 200ms total: ~50ms retrieval, ~100ms ranking, ~50ms re-ranking + rendering. |
Scope
In scope
- Multi-stage ranking (retrieval → coarse → fine → re-rank)
- Feature store for online inference
- Offline training with NDCG optimization
- Safe model deployment pipeline
Out of scope (state explicitly)
- Full index construction (covered in Search Engine #12)
- Crawling and document ingestion
- Ad auction / monetization ranking
Assumptions
- 100K queries/sec, 100B indexed documents
- 200ms p99 latency budget for full pipeline
- Daily model retraining; index updated hourly
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Multi-stage ranking pipeline: retrieval → coarse ranking → fine ranking → re-ranking
- Feature extraction: query, document, and query-document interaction features
- Support multiple ranking objectives: relevance, freshness, diversity, personalization
- Online A/B testing of ranking models
- Click-through feedback loop: use user clicks to improve ranking
- Support for different verticals: web, images, videos, news, shopping
- Low Latency: Full ranking pipeline < 200ms p99
- High Throughput: 100K+ queries/sec
- Freshness: Model updated daily; index updated hourly
- Quality: NDCG@10 primary offline metric; CTR@1 online
| Metric | Calculation | Value |
|---|---|---|
| Queries / sec | Given | 100K |
| Index size | Given | 100B documents |
| Retrieval candidates | ANN top-K | 10K per query |
| Coarse rank input | Bi-encoder scores 10K | ~30ms GPU batch |
| Fine rank input | Cross-encoder on top 500 | ~100ms |
| Feature store reads | 100K QPS × 20 features | 2M feature lookups/sec |
| Daily training data | 100K × 86400 × 10 results | ~86B impression logs/day |
Learning to Rank Approaches
Pointwise: Treat ranking as regression/classification Input: (query, document) → Output: relevance score (0-4) Loss: MSE or cross-entropy Problem: Doesn't optimize for ranking order directly Pairwise: Predict which of two documents is more relevant Input: (query, doc_A, doc_B) → Output: P(A better than B) Better than pointwise: learns relative ordering Listwise: Optimize ranking metric directly Input: (query, [doc_1, ..., doc_n]) → Output: ranked list Loss: approximate NDCG/MAP gradient Example: LambdaMART — industry standard Best for search: directly optimizes what you measure
Relevance Labels & Training Data
Human judgments (gold standard, expensive): Raters judge (query, document) pairs on 0-4 scale. Click data (cheap, biased): Click = positive signal, but position-biased. De-biasing via inverse propensity scoring (IPS). Engagement metrics: Time on page, scroll depth, share, bookmark.
Key Features for Search Ranking
Query Features: - Query length, intent (navigational/informational/transactional) - Named entity type, query frequency, spell-corrected query Document Features: - PageRank / domain authority, content freshness - Document length, historical CTR, spam score Query-Document Interaction Features: - BM25 score (title, body, anchor text), TF-IDF cosine similarity - BERT cross-encoder score (query × title) - Embedding cosine similarity (bi-encoder) User Features (personalization): - Search history, click history, location, language, device type
Feature Store Architecture
Feature Store Architecture: - Feature Pipeline — Processes raw event logs via Apache Beam/Flink. Computes aggregations defined in features.yaml ? writes to online + offline stores. - Online Feature Store (Redis) — Low-latency for real-time inference. Keyed by entity_id, returns feature vectors in < 5ms. - Offline Feature Store (S3/Parquet) — Historical feature snapshots. Point-in-time correct values alongside training labels. - Feature Versioning — version: 3 means pipeline updated 3 times. Models pin to specific version; old versions maintained until all models migrate.
Model Deployment Pipeline
Stage 1: OFFLINE EVALUATION Train challenger, evaluate: NDCG@10 improved = 0.5%? Latency p99 = 150ms? Stage 2: SHADOW MODE (1-3 days) Both models score EVERY query. Only champion's results shown. Compare: predicted NDCG, latency, error rates. Stage 3: INTERLEAVING EXPERIMENT (3-7 days) Mix results from both models in same SERP (odd=champion, even=challenger). Measure: which model's results get more clicks. Needs 10× fewer queries than A/B test for significance. Stage 4: CANARY (1% traffic, 2-3 days) Monitor: NDCG, CTR@1, abandonment rate, latency p99. Auto-rollback on: NDCG drop > 1%, latency p99 > 200ms, error rate > 0.1%. Stage 5: GRADUAL RAMP (1-2 weeks) 1% ? 5% ? 25% ? 50% ? 100%. Each step: 2-day bake time. Rollback at any stage: config change, takes effect in < 60 seconds.
Search API
POST /api/v1/search
Content-Type: application/json
{
"query": "best noise cancelling headphones",
"user_id": "u-123",
"vertical": "web",
"limit": 10
}
Response: 200 OK
{
"results": [
{ "doc_id": "d-456", "title": "...", "score": 0.92, "snippet": "..." }
],
"latency_ms": 142,
"model_version": "ranker-v47"
}Click Feedback (async)
POST /api/v1/feedback/click
{ "query_id": "q-789", "doc_id": "d-456", "position": 1, "dwell_ms": 4200 }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
Feature Store (Redis: online)
user_features:{user_id} → Hash (embedding, history, locale)
doc_features:{doc_id} → Hash (pagerank, freshness, spam_score)
query_doc:{query_hash}:{doc} → Float (bm25, cross_encoder_score)Training Data (S3 / Parquet: offline)
Columns: query, doc_id, label (0-4), features[], timestamp, model_version Partitioned by: date/hour Point-in-time joins: feature snapshot at impression time (avoid leakage)
- Model failure: fall back to BM25 ranking
- Feature store unavailable: serve with default/missing features
- Latency spike: skip fine ranking stage, serve coarse ranking results
- Index failure: serve from replica index; rebuild from source data
Offline Metrics: NDCG@K (THE primary metric for search), MAP (Mean Average Precision), MRR (Mean Reciprocal Rank).
Online Metrics: CTR@1, abandonment rate, time-to-first-click, pogo-sticking rate, sessions per search.
Interview Walkthrough
- Separate retrieval (BM25/ANN gets top 1000) from learning-to-rank (ML scores top 50) — two-stage funnel.
- Explain training data from click logs with position debiasing — top results get clicks unfairly.
- Cover feature categories: query-document match, page quality, freshness, personalization signals.
- Discuss offline NDCG evaluation before online A/B testing any ranker change.
- Mention semantic search (embeddings) vs keyword search hybrid with score fusion.
- Common pitfall: optimizing click-through rate alone — users click sensational results that hurt long-term satisfaction.
Query: "best noise cancelling headphones" 5 results, human relevance judgments (0-4 scale): Position 1: Result A → relevance = 3 Position 2: Result B → relevance = 2 Position 3: Result C → relevance = 0 Position 4: Result D → relevance = 1 Position 5: Result E → relevance = 0 Step 1: DCG = S (2^rel_i - 1) / log2(i + 1) Pos 1: (2³-1)/log2(2) = 7.000 Pos 2: (2²-1)/log2(3) = 1.893 Pos 3: (2°-1)/log2(4) = 0.000 Pos 4: (2¹-1)/log2(5) = 0.431 Pos 5: (2°-1)/log2(6) = 0.000 DCG@5 = 9.324 Step 2: IDCG (ideal order: [3, 2, 1, 0, 0]) = 9.393 Step 3: NDCG = 9.324 / 9.393 = 0.993 LambdaMART: lambda for swapping C and D = ?NDCG = 0.007 For a bigger relevance gap, lambda would be much larger
Semantic Search vs Keyword Search
Hybrid (Best practice): score = a × BM25_score + ß × embedding_similarity. Tune a, ß with LTR model: both are features for the ranker.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — BM25 + manual rules
Inverted index with BM25 scoring. Manual boosts for freshness and domain authority. No ML.
Key components: Elasticsearch · BM25 · Manual boost rules
Move to next phase when: Relevance plateaus; product wants personalization
Phase 2 — Learning to Rank (LambdaMART)
Feature extraction pipeline. Offline training on human judgments + click data. Two-stage: BM25 retrieval → LambdaMART on top 1000.
Key components: Feature pipeline · LambdaMART · Offline eval (NDCG@10) · A/B testing framework
Move to next phase when: Latency budget exceeded at 1000 candidates; need ANN retrieval
Phase 3 — Neural + real-time features
Two-tower bi-encoder for retrieval. Cross-encoder fine ranker. Online feature store (Redis). Shadow → interleaving → canary deploy pipeline.
Key components: Two-tower ANN · Cross-encoder · Feature store · Interleaving experiments · Auto-rollback
Move to next phase when: Query QPS exceeds 100K; need sub-200ms with personalization
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Search p99 latency (full pipeline) | < 200ms | User abandonment spikes above 300ms |
| NDCG@10 (offline, weekly) | > baseline + 0.5% per quarter | Primary quality metric |
| Zero-result rate | < 2% | Spell correction + query expansion fallback |
| Feature store p99 | < 5ms | Budget allocation within 200ms total |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Ranking model serves irrelevant results after deploy | NDCG drop in shadow mode; CTR@1 drops 3% in canary | Automatic rollback within 60s; revert to previous model version; post-mortem on feature skew |
| Index replica lag causes stale results | Freshness feature shows docs > 1 hour old in top 10 | Route to caught-up replica; increase refresh frequency; boost freshness weight temporarily |
| Adversarial SEO spam in top results | Spam score feature spike; user report rate increases | Emergency demotion rule; retrain with spam labels; manual review queue for flagged domains |
Cost Drivers (Staff lens)
- GPU inference for cross-encoder on top 500 candidates per query
- Feature store memory (user + doc embeddings at scale)
- Index storage: 100B docs × inverted index + embeddings
Multi-Region & DR
Index replicated per region for latency. Feature store: user features follow user (global); doc features local to index shard. Model serving: same model version globally; deploy synchronized to prevent ranking inconsistency.