This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | The indexing + retrieval core. Draw inverted index structure, explain BM25 scoring with intuition (not just formula), sharding by term/document, and the full query pipeline from parse → retrieve → rank → snippet. |
| Arch 50 | Add query parsing (boolean, phrase, operators), spell correction, and two-stage ranking (cheap retrieval → expensive rerank). |
| Arch 75 | Staff: index freshness vs rebuild, tail query handling, adversarial SEO spam, and cost of serving 1B queries/day. |
Interview Prompt
Design a web search engine like Google. Crawl and index billions of web pages. Given a search query, return the top 10 most relevant results with snippets in under 200ms.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| How many documents and what's the query QPS? | 1B docs × 10K QPS drives sharding strategy; 100M docs fits on fewer shards. |
| Full web crawl or search over a fixed corpus? | Full crawl adds crawler, index freshness, and incremental update — assume we have crawled docs. |
| Simple keyword search or also phrase/boolean operators? | Query parser complexity — 'apple AND juice' vs 'apple juice' need different index lookups. |
| Personalized ranking or global relevance only? | Personalization adds user profile lookup in ranking stage — separate from core retrieval. |
Scope
In scope
- Inverted index design and sharding
- TF-IDF / BM25 scoring
- Query parsing and execution
- Ranking pipeline (retrieval + rerank)
- Snippet generation
- Capacity estimation
Out of scope (state explicitly)
- Web crawler internals (separate problem)
- Ad auction and sponsored results
- Full PageRank iterative computation
- Image/video search
Assumptions
- 1B indexed documents, avg 10 KB text each
- 10K queries/sec peak, 1K avg
- Return top 10 results per query
- Index update latency: new docs searchable within 1 hour
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Accept a text query and return a ranked list of relevant web pages
- Support keyword matching, phrase matching, and boolean queries
- Rank results by relevance (PageRank + text relevance + freshness + personalization)
- Display result snippets (title, URL, description with highlighted keywords)
- Support autocomplete/typeahead (see #11)
- Support image, video, and news search (vertical search)
- Spell correction: "systm desgn" → "Did you mean: system design?"
- Knowledge panels for entities (people, places, companies)
- Pagination of results
- Low Latency: Results in < 500 ms (ideally < 200 ms)
- High Availability: 99.99%
- Scalability: 100B+ indexed pages, 10B+ queries/day
- Freshness: New/updated pages indexed within minutes (for news) to hours
- Relevance: Results must be useful and accurate (quality is king)
- Spam Resistant: Resist SEO manipulation and web spam
| Metric | Calculation | Value |
|---|---|---|
| Indexed pages | Given (assumption documented in value) | 100B |
| Avg page size (compressed) | Given (typical workload assumption) | 100 KB |
| Raw index size | 100B × 100 KB | 10 PB |
| Inverted index size | ~10% of raw | 1 PB |
| Queries / day | Given | 10B |
| Queries / sec | 10B ÷ 86400 | ~115K (peak ~500K) |
| Index servers (1TB per server) | Given | ~1,000 index servers |
| Crawl rate | Given (assumption documented in value) | 1B pages/day |
Query Service (Query Understanding)
- Query parsing: Tokenize, lowercase, remove stop words, stem/lemmatize
- Spell correction: Edit distance + language model. "systm" → candidate corrections → pick highest probability correction
- Query expansion: "NYC restaurants" → also search "New York City restaurants"
- Intent detection: Is this a navigational query ("facebook login"), informational ("how to cook pasta"), or transactional ("buy iPhone 15")?
- Synonym handling: "car" → also search "automobile", "vehicle"
Inverted Index: Core Data Structure
An inverted index maps terms → list of documents containing that term.
"system" → [doc1:3, doc5:1, doc15:7, doc22:2, ...] (doc_id:term_frequency) "design" → [doc1:2, doc3:5, doc15:4, ...] "interview" → [doc1:1, doc15:3, doc42:2, ...]
Index Structure per term (posting list):
{
term: "system",
document_frequency: 50000000, // how many docs contain this term
posting_list: [
{doc_id: 1, tf: 3, positions: [15, 42, 78]},
{doc_id: 5, tf: 1, positions: [201]},
...
]
}Index Sharding: Term-based sharding (each shard holds a subset of terms) vs Document-based sharding (⭐ recommended) where each shard holds all terms for a subset of documents. Query is broadcast to ALL shards → each shard returns local top-K → merge results. Google uses this approach (called "index tiers").
Index Compression: Posting lists are compressed using Variable-Byte Encoding or PForDelta. Doc IDs are delta-encoded: [1, 5, 15, 22] → [1, 4, 10, 7] (smaller values compress better). Reduces index size by 5-10x.
Ranking Service: The Scoring Pipeline
Stage 1: Initial Retrieval (cheap, coarse): Boolean matching: Find documents containing ALL query terms (AND query). Use inverted index to intersect posting lists efficiently. Return candidate set (e.g., top 10,000 documents).
Stage 2: Scoring (TF-IDF / BM25): BM25 (Best Matching 25) is the industry standard text relevance formula using term frequency, document length, and inverse document frequency.
Stage 3: PageRank (Link Analysis): Measures page importance based on incoming links. Computed offline via iterative MapReduce (20-50 iterations until convergence). Stored per document, used as a static quality signal.
Stage 4: ML Re-ranking (expensive, precise): Learning-to-Rank model (LambdaMART, neural models). Features: BM25 score, PageRank, click-through rate, freshness, domain authority, user location. Re-ranks top 1,000 candidates from previous stages. Returns top 10 for display.
Snippet Generator
- For each result, generate a relevant snippet showing query terms in context
- Find the passage in the document with highest query term density
- Highlight matching terms in bold
- Truncate to ~160 characters
GET /api/v1/search?q=system+design+interview&page=1&lang=en&country=US
Response: 200 OK
{
"query": "system design interview",
"spell_correction": null,
"results_count": 125000000,
"results": [
{
"title": "System Design Interview Guide - ByteByteGo",
"url": "https://bytebytego.com/system-design",
"snippet": "A comprehensive guide to <b>system design interview</b> preparation...",
"favicon": "https://bytebytego.com/favicon.ico",
"cached_url": "...",
"rank": 1
},
...
],
"related_searches": ["system design interview questions", "system design primer"],
"knowledge_panel": null,
"pagination": {"page": 1, "total_pages": 100}
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
Inverted Index (SSTable-like format on disk)
Term Dictionary (in-memory):
"system" → {offset: 0x4A2F, doc_freq: 50000000}
"design" → {offset: 0x8B1C, doc_freq: 35000000}
Posting List (on disk, compressed):
Offset 0x4A2F:
[doc_id_delta: 1, tf: 3, positions: [15, 42, 78]]
[doc_id_delta: 4, tf: 1, positions: [201]]
...Document Store (Bigtable / GFS)
Row Key: doc_id (hash of URL)
Column Families:
content: {title, body_text, meta_description, language}
metadata: {url, domain, crawl_date, content_hash, robots_directives}
links: {outgoing_urls[], incoming_count}
scores: {pagerank, spam_score, domain_authority}URL Frontier (for Crawler)
Priority Queue:
{url, priority, last_crawled, crawl_interval, domain}PageRank Store
Key: doc_id
Value: {pagerank_score: 0.00042, last_computed: timestamp}| Concern | Solution |
|---|---|
| Index server failure | Each index shard replicated 3×; query routed to healthy replica |
| Index corruption | Checksum verification; rebuild from content store |
| Query overload | Circuit breaker + graceful degradation (skip ML re-ranking, serve BM25 only) |
| Stale index | Incremental index updates; full rebuild weekly |
| Crawler politeness | Respect robots.txt, rate limit per domain (see #13) |
Specific: Index Serving Architecture: Index Tiers
- Tier 0: Most important pages (top 1B): always searched
- Tier 1: Important pages (top 10B): searched if Tier 0 insufficient
- Tier 2: Long tail (100B): searched only for rare queries
- Tiering reduces latency: most queries are answered by Tier 0 alone
Index Update Strategy
- Full rebuild: MapReduce job rebuilds entire index weekly (handles deletions, re-scoring)
- Incremental updates: Real-time pipeline adds new/updated pages to a "supplement index"
- Query searches both main index + supplement index; results merged
- Supplement index is periodically merged into main index
Anti-Spam / Web Spam Detection
- Link spam: Detect link farms (clusters of pages linking to each other artificially)
- Content spam: Keyword stuffing, hidden text, cloaking detection
- Click spam: Click-through rate manipulation detection
- Signals: Domain age, SSL certificate, content uniqueness, link velocity
Query Result Caching
- Cache top results for popular queries in Redis/Memcached
- 25% of queries are repeated within an hour → high cache hit ratio
- Cache key: normalized query + location + language
- TTL: 1 hour for most queries, 5 min for news queries
Semantic Search
- Beyond keyword matching: understand query meaning
- BERT / Transformer models: Encode query and documents into embedding vectors
- Vector similarity search (ANN: Approximate Nearest Neighbor) using FAISS or ScaNN
- Hybrid: BM25 keyword matching + semantic similarity → combined score
Interview Walkthrough
- Sketch the three-stage pipeline first — crawl, index, query — so the interviewer sees you understand the full lifecycle, not just the query box.
- Center the deep dive on the inverted index: term → posting list of (doc_id, positions, frequency) — this is the core data structure.
- Shard the index by document ID using Sharding and Partitioning; scatter-gather queries across shards in parallel with a strict latency budget.
- Allocate your 200 ms query budget explicitly: query parsing (10 ms), shard fan-out (5 ms), per-shard top-K (25 ms), merge + rerank (50 ms).
- Apply Caching Patterns to popular queries — roughly 25% of searches repeat within an hour, making result caching high-impact.
- Describe the ranking stack: BM25 for recall, then ML reranking on the top 1,000 candidates for precision.
- Mention incremental indexing (supplement index merged periodically) to handle fresh content without full rebuilds.
- Common pitfall: proposing to run
LIKE '%keyword%'scans on a relational DB — interviewers expect an inverted index.
End-to-End Query Execution Flow with Timing Budget
To serve search queries within an ultra-strict < 200 ms total budget, Google structures execution across parallel stages with fine-grained millisecond caps:
- Step 1: Query Understanding (10 ms): Tokenization, spell-checking, synonym query expansion (e.g. "NYC" → "New York City"), and intent classification.
- Step 2: Scatter to Index Shards (5 ms network): Stateless Query Coordinator broadcasts query to 100 document-sharded Tier-0 servers in parallel. Enforces a 150 ms deadline before skipping tardy nodes.
- Step 3: Per-Shard Processing (25 ms): Each shard performs compressed posting-list intersection (using fast two-pointer merges on sorted lists) + initial BM25 local scoring and local Top-100 heap selection.
- Step 4: Gather + Merge (5 ms): Coordinator receives up to 10,000 local candidates and performs a fast K-Way max-heap merge to fetch the global top 1,000.
- Step 5: ML Re-ranking (50 ms): Evaluates the top 1,000 candidates across hundreds of features (CTR, freshness, domain authority, PageRank) using LambdaMART + heavy BERT cross-encoders for the top 200.
- Step 6: Snippet Generation (10 ms): Fetches high-density passages for the final top 10 documents, highlights matching terms, and truncates snippets.
- Step 7: Response Assembly (5 ms): Inject related search queries, dynamic knowledge cards, and serialize the JSON payload.
Timing Summary: 10 + 5 + 25 + 5 + 50 + 10 + 5 = 110 ms total, well within target limits.
Scatter-Gather: Index Shard Coordination
Stateless query coordinators scatter queries to all document shards in parallel, presenting a major tail-latency issue.
The Tail Latency (Slow Shard) Problem:
- With 100 index shards, the aggregate query latency is bound by the slowest single shard.
- At 100 shards, the p99 latency of any single shard is approximately 3-5 times its median (p50) latency (due to garbage collection pauses, background indexing contention, or physical hardware issues).
- The likelihood of the entire query being slowed down by at least one laggy shard scales exponentially:
P(at least one shard is slow) = 1 - (1 - 0.01)¹⁰⁰ = 63% - Without software-level mitigations, 63% of all user search queries would suffer from high tail latency.
Technical Solutions:
- Hedged Requests (⭐ Recommended): Send the query to the primary replica of a shard. If it does not respond within 30 ms, send an identical speculative request to the secondary replica. Take the result from whichever replica responds first and cancel the other. This reduces tail latency drastically while adding only a small (~5%) increase in overall throughput load.
- Speculative Execution / Partial Returns: If 95 out of 100 shards have returned within 100 ms, complete the query using only those 95 shards. Missing a minor 5% of candidate documents is visually and qualitatively invisible to the user.
- Index Tiers: Split the index into Tier 0 (viral/popular 1B documents), Tier 1 (10B documents), and Tier 2 (100B documents). Only fallback to search Tier 1/2 if Tier 0 yields insufficient results, avoiding massive coordination fan-outs.
Shard Routing Trade-off: Document vs. Term Sharding
- Document Sharding (Google's choice): Each shard stores the entire inverted index dictionary for a subset of documents. The query coordinator must broadcast queries to all shards (fixed 100 RPCs), but execution is highly uniform and simple.
- Term Sharding: Shards are partitioned by lexicographical term ranges (e.g., Shard A stores terms A-C). While a query only requires querying a few shards (e.g., query "system design" needs shards for 's' and 'd' → 2 RPCs), intersecting lists across nodes is incredibly network-expensive, and popular terms create massive load imbalances.
Result Merging Across Document Shards
To accurately merge Top-100 scores returned by independent document shards, we must solve BM25 score comparability and execute sorting efficiently.
The Global Score Comparability Problem:
- The BM25 formula depends heavily on **Inverse Document Frequency (IDF)**:
IDF(q) = log((N - n(q) + 0.5) / (n(q) + 0.5))
WhereNis the total documents in the corpus andn(q)is the count of documents containing the term. - If shards computed IDF using only their local document subset, the same term would carry different IDF values across shards, making the calculated BM25 scores incomparable.
- Solution: Global IDF Precomputation during Offline Index Build (MapReduce)
- Count total documents
Nacross ALL shards. - Count document frequency
n(q)for each term across ALL shards. - Compute the absolute global IDF value for each term.
- Store these IDF values in each shard's term dictionary.
- Count total documents
- At query time, every shard uses the exact same global IDF value, making BM25 scores globally comparable and ensuring the K-way merge produces the mathematically correct global Top-K.
K-Way Merge Algorithm:
With 100 shards returning their local Top-100 results sorted by score descending, the coordinator must merge 10,000 candidates:
- Create a Max-Heap of size 100 (containing one entry per shard).
- Insert the first (highest scoring) result from each of the 100 shards into the heap.
- Pop the maximum item from the heap → this is the global #1 result.
- Push the next highest scoring result from that popped shard into the heap.
- Repeat steps 3-4 until the global top-K (e.g., top 1,000) results are fully extracted.
This algorithm runs in extremely fast O(K × log(num_shards)) time, executing only around 7,000 comparisons (O(1000 × log(100))) to merge 10,000 candidates.
- Tie-Breaking: If text relevance scores are identical, static PageRank authority scores are used as a secondary sort.
- Near-Duplicate Deduping: Since document-sharding guarantees a document resides on exactly one shard, duplicates are impossible. For near-duplicates across different URLs, SimHash values are calculated on page text offline, and near-duplicates are filtered out in the ML re-ranking stage.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — MVP (single-node Lucene)
Single Elasticsearch/Lucene instance. 10M documents. BM25 scoring. No query parser (simple AND). Precomputed snippets. Serves 100 QPS.
Key components: Lucene/Elasticsearch · Simple tokenizer · BM25 · Static snippets
Move to next phase when: Index exceeds 64 GB RAM; query latency >500ms; need >100 QPS
Phase 2 — Sharded index + ranking pipeline
100 document-sharded index servers. Query coordinator broadcasts and merges. Two-stage rank: BM25 retrieval → LambdaMART rerank on top-100. Query parser with phrase/operator support. Kafka incremental indexer.
Key components: Sharded Lucene · Query coordinator · ML reranker · Kafka indexer · Spell checker
Move to next phase when: 1B documents; need global deployment; ML ranking requires feature store
Phase 3 — Global scale (1B docs, 10K QPS)
Multi-region index replicas. Term-based routing for single-term queries. Feature store (PageRank, click logs, freshness). Personalized rerank layer. Snippet cache in Redis. Query understanding (intent classification) pre-parser.
Key components: Multi-region shards · Feature store · Personalization · Query understanding · Snippet cache
Move to next phase when: Tail latency SLO breach; or need sub-100ms p99 globally
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Search p99 latency | < 200ms | Users abandon after 300ms — Google internal benchmark |
| Index freshness | < 1 hour | News and ecommerce need recent content |
| Query success rate (non-empty results) | > 95% | Empty results = bad experience; spell correction helps |
| Index availability | 99.99% | Search downtime is total product outage |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Hot term shard overload ('superbowl' during event) | Single shard p99 >2s; posting list for term exceeds memory cache; other shards normal | Replicate hot term posting list to all shards (pre-warmed for known events); enable query result cache for exact query string; rate-limit autocomplete separately from search |
| Bad segment merge corrupts index shard | Query results missing known docs; shard N doc count drops 30%; merge job error logs | Take shard offline; restore from replica; replay Kafka index events from last checkpoint; disable auto-merge until root cause fixed |
| SEO spam flood indexes 10M junk pages in 1 hour | Index growth rate 100× normal; avg BM25 scores drop; user click-through rate plummets | Emergency crawl policy tightening; bulk delete by domain pattern; reindex affected shards; add domain authority threshold to indexer gate |
Cost Drivers (Staff lens)
- Index storage: 1B docs × 10 KB text × 3× index overhead ≈ 30 TB (compressed posting lists + stored fields)
- Query serving: 10K QPS × 100 shards × 50 ms CPU = ~500 core-shards always active
- ML reranker: GPU inference on top-100 × 10K QPS = significant if not batched (batch to 50ms windows)
- Kafka indexer pipeline: proportional to crawl rate, not query rate
Multi-Region & DR
Full index replicated to 3+ regions (read-only replicas). Query served locally — no cross-region index lookup. Incremental updates originate from primary crawl region, replicated via Kafka mirroring. User-specific ranking features (click history) stored regionally. Global signals (PageRank) synced daily.