Design a Search Ranking System (Learning to Rank)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Understand retrieval → ranking → re-ranking stages and why you can't run a cross-encoder on 10M docs.
Arch 75	ML systems design — expect multi-stage ranking pipeline, offline vs online metrics, and model deployment safety (shadow → interleaving → canary).

Interview Prompt

Design a search ranking system that takes a user query and returns the most relevant documents from billions of indexed pages. The system should improve over time using click feedback.

Clarifying Questions (ask before designing)

Question	Why it matters
What's the corpus size and query QPS?	100K QPS with 100B docs drives ANN retrieval + two-stage ranking, not brute-force scoring.
Offline metric vs online metric — which do we optimize?	NDCG@10 offline; CTR@1 online. They can diverge — position bias in click data.
Personalization required or global ranking?	Personalization adds user feature store latency budget (~5ms) and cold-start complexity.
Latency budget for full pipeline?	200ms total: ~50ms retrieval, ~100ms ranking, ~50ms re-ranking + rendering.

Scope

In scope

Multi-stage ranking (retrieval → coarse → fine → re-rank)
Feature store for online inference
Offline training with NDCG optimization
Safe model deployment pipeline

Out of scope (state explicitly)

Full index construction (covered in Search Engine #12)
Crawling and document ingestion
Ad auction / monetization ranking

Assumptions

100K queries/sec, 100B indexed documents
200ms p99 latency budget for full pipeline
Daily model retraining; index updated hourly

Multi-stage ranking pipeline: retrieval → coarse ranking → fine ranking → re-ranking
Feature extraction: query, document, and query-document interaction features
Support multiple ranking objectives: relevance, freshness, diversity, personalization
Online A/B testing of ranking models
Click-through feedback loop: use user clicks to improve ranking
Support for different verticals: web, images, videos, news, shopping

Metric	Calculation	Value
Queries / sec	Given	100K
Index size	Given	100B documents
Retrieval candidates	ANN top-K	10K per query
Coarse rank input	Bi-encoder scores 10K	~30ms GPU batch
Fine rank input	Cross-encoder on top 500	~100ms
Feature store reads	100K QPS × 20 features	2M feature lookups/sec
Daily training data	100K × 86400 × 10 results	~86B impression logs/day

Loading...

Learning to Rank Approaches

Pointwise: Treat ranking as regression/classification
  Input: (query, document) → Output: relevance score (0-4)
  Loss: MSE or cross-entropy
  Problem: Doesn't optimize for ranking order directly

Pairwise: Predict which of two documents is more relevant
  Input: (query, doc_A, doc_B) → Output: P(A better than B)
  Better than pointwise: learns relative ordering

Listwise: Optimize ranking metric directly
  Input: (query, [doc_1, ..., doc_n]) → Output: ranked list
  Loss: approximate NDCG/MAP gradient
  Example: LambdaMART — industry standard
  Best for search: directly optimizes what you measure

Relevance Labels & Training Data

Human judgments (gold standard, expensive): Raters judge (query, document) pairs on 0-4 scale. Click data (cheap, biased): Click = positive signal, but position-biased. De-biasing via inverse propensity scoring (IPS). Engagement metrics: Time on page, scroll depth, share, bookmark.

Key Features for Search Ranking

Query Features:
  - Query length, intent (navigational/informational/transactional)
  - Named entity type, query frequency, spell-corrected query

Document Features:
  - PageRank / domain authority, content freshness
  - Document length, historical CTR, spam score

Query-Document Interaction Features:
  - BM25 score (title, body, anchor text), TF-IDF cosine similarity
  - BERT cross-encoder score (query × title)
  - Embedding cosine similarity (bi-encoder)

User Features (personalization):
  - Search history, click history, location, language, device type

Feature Store Architecture

Feature Store Architecture:

- Feature Pipeline — Processes raw event logs via Apache Beam/Flink.
  Computes aggregations defined in features.yaml ? writes to online + offline stores.

- Online Feature Store (Redis) — Low-latency for real-time inference.
  Keyed by entity_id, returns feature vectors in < 5ms.

- Offline Feature Store (S3/Parquet) — Historical feature snapshots.
  Point-in-time correct values alongside training labels.

- Feature Versioning — version: 3 means pipeline updated 3 times.
  Models pin to specific version; old versions maintained until all models migrate.

Model Deployment Pipeline

Stage 1: OFFLINE EVALUATION
  Train challenger, evaluate: NDCG@10 improved = 0.5%? Latency p99 = 150ms?

Stage 2: SHADOW MODE (1-3 days)
  Both models score EVERY query. Only champion's results shown.
  Compare: predicted NDCG, latency, error rates.

Stage 3: INTERLEAVING EXPERIMENT (3-7 days)
  Mix results from both models in same SERP (odd=champion, even=challenger).
  Measure: which model's results get more clicks.
  Needs 10× fewer queries than A/B test for significance.

Stage 4: CANARY (1% traffic, 2-3 days)
  Monitor: NDCG, CTR@1, abandonment rate, latency p99.
  Auto-rollback on: NDCG drop > 1%, latency p99 > 200ms, error rate > 0.1%.

Stage 5: GRADUAL RAMP (1-2 weeks)
  1% ? 5% ? 25% ? 50% ? 100%. Each step: 2-day bake time.

Rollback at any stage: config change, takes effect in < 60 seconds.

Search API

HTTP

POST /api/v1/search
Content-Type: application/json

{
  "query": "best noise cancelling headphones",
  "user_id": "u-123",
  "vertical": "web",
  "limit": 10
}

Response: 200 OK
{
  "results": [
    { "doc_id": "d-456", "title": "...", "score": 0.92, "snippet": "..." }
  ],
  "latency_ms": 142,
  "model_version": "ranker-v47"
}

Click Feedback (async)

HTTP

POST /api/v1/feedback/click
{ "query_id": "q-789", "doc_id": "d-456", "position": 1, "dwell_ms": 4200 }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

SLOs & Error Budgets

Metric	Target	Rationale
Search p99 latency (full pipeline)	< 200ms	User abandonment spikes above 300ms
NDCG@10 (offline, weekly)	> baseline + 0.5% per quarter	Primary quality metric
Zero-result rate	< 2%	Spell correction + query expansion fallback
Feature store p99	< 5ms	Budget allocation within 200ms total

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Ranking model serves irrelevant results after deploy	NDCG drop in shadow mode; CTR@1 drops 3% in canary	Automatic rollback within 60s; revert to previous model version; post-mortem on feature skew
Index replica lag causes stale results	Freshness feature shows docs > 1 hour old in top 10	Route to caught-up replica; increase refresh frequency; boost freshness weight temporarily
Adversarial SEO spam in top results	Spam score feature spike; user report rate increases	Emergency demotion rule; retrain with spam labels; manual review queue for flagged domains

Cost Drivers (Staff lens)

GPU inference for cross-encoder on top 500 candidates per query
Feature store memory (user + doc embeddings at scale)
Index storage: 100B docs × inverted index + embeddings

Multi-Region & DR

Index replicated per region for latency. Feature store: user features follow user (global); doc features local to index shard. Model serving: same model version globally; deploy synchronized to prevent ranking inconsistency.