Design a Search Engine (Google)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	The indexing + retrieval core. Draw inverted index structure, explain BM25 scoring with intuition (not just formula), sharding by term/document, and the full query pipeline from parse → retrieve → rank → snippet.
Arch 50	Add query parsing (boolean, phrase, operators), spell correction, and two-stage ranking (cheap retrieval → expensive rerank).
Arch 75	Staff: index freshness vs rebuild, tail query handling, adversarial SEO spam, and cost of serving 1B queries/day.

Interview Prompt

Design a web search engine like Google. Crawl and index billions of web pages. Given a search query, return the top 10 most relevant results with snippets in under 200ms.

Clarifying Questions (ask before designing)

Question	Why it matters
How many documents and what's the query QPS?	1B docs × 10K QPS drives sharding strategy; 100M docs fits on fewer shards.
Full web crawl or search over a fixed corpus?	Full crawl adds crawler, index freshness, and incremental update — assume we have crawled docs.
Simple keyword search or also phrase/boolean operators?	Query parser complexity — 'apple AND juice' vs 'apple juice' need different index lookups.
Personalized ranking or global relevance only?	Personalization adds user profile lookup in ranking stage — separate from core retrieval.

Scope

In scope

Inverted index design and sharding
TF-IDF / BM25 scoring
Query parsing and execution
Ranking pipeline (retrieval + rerank)
Snippet generation
Capacity estimation

Out of scope (state explicitly)

Web crawler internals (separate problem)
Ad auction and sponsored results
Full PageRank iterative computation
Image/video search

Assumptions

1B indexed documents, avg 10 KB text each
10K queries/sec peak, 1K avg
Return top 10 results per query
Index update latency: new docs searchable within 1 hour

Accept a text query and return a ranked list of relevant web pages
Support keyword matching, phrase matching, and boolean queries
Rank results by relevance (PageRank + text relevance + freshness + personalization)
Display result snippets (title, URL, description with highlighted keywords)
Support autocomplete/typeahead (see #11)
Support image, video, and news search (vertical search)
Spell correction: "systm desgn" → "Did you mean: system design?"
Knowledge panels for entities (people, places, companies)
Pagination of results

Metric	Calculation	Value
Indexed pages	Given (assumption documented in value)	100B
Avg page size (compressed)	Given (typical workload assumption)	100 KB
Raw index size	100B × 100 KB	10 PB
Inverted index size	~10% of raw	1 PB
Queries / day	Given	10B
Queries / sec	10B ÷ 86400	~115K (peak ~500K)
Index servers (1TB per server)	Given	~1,000 index servers
Crawl rate	Given (assumption documented in value)	1B pages/day

Loading...

Query Service (Query Understanding)

Query parsing: Tokenize, lowercase, remove stop words, stem/lemmatize
Spell correction: Edit distance + language model. "systm" → candidate corrections → pick highest probability correction
Query expansion: "NYC restaurants" → also search "New York City restaurants"
Intent detection: Is this a navigational query ("facebook login"), informational ("how to cook pasta"), or transactional ("buy iPhone 15")?
Synonym handling: "car" → also search "automobile", "vehicle"

Inverted Index: Core Data Structure

An inverted index maps terms → list of documents containing that term.

"system"     → [doc1:3, doc5:1, doc15:7, doc22:2, ...]   (doc_id:term_frequency)
"design"     → [doc1:2, doc3:5, doc15:4, ...]
"interview"  → [doc1:1, doc15:3, doc42:2, ...]

Index Structure per term (posting list):

{
  term: "system",
  document_frequency: 50000000,     // how many docs contain this term
  posting_list: [
    {doc_id: 1, tf: 3, positions: [15, 42, 78]},
    {doc_id: 5, tf: 1, positions: [201]},
    ...
  ]
}

Index Sharding: Term-based sharding (each shard holds a subset of terms) vs Document-based sharding (⭐ recommended) where each shard holds all terms for a subset of documents. Query is broadcast to ALL shards → each shard returns local top-K → merge results. Google uses this approach (called "index tiers").

Index Compression: Posting lists are compressed using Variable-Byte Encoding or PForDelta. Doc IDs are delta-encoded: [1, 5, 15, 22] → [1, 4, 10, 7] (smaller values compress better). Reduces index size by 5-10x.

Ranking Service: The Scoring Pipeline

Stage 1: Initial Retrieval (cheap, coarse): Boolean matching: Find documents containing ALL query terms (AND query). Use inverted index to intersect posting lists efficiently. Return candidate set (e.g., top 10,000 documents).

Stage 2: Scoring (TF-IDF / BM25): BM25 (Best Matching 25) is the industry standard text relevance formula using term frequency, document length, and inverse document frequency.

Stage 3: PageRank (Link Analysis): Measures page importance based on incoming links. Computed offline via iterative MapReduce (20-50 iterations until convergence). Stored per document, used as a static quality signal.

Stage 4: ML Re-ranking (expensive, precise): Learning-to-Rank model (LambdaMART, neural models). Features: BM25 score, PageRank, click-through rate, freshness, domain authority, user location. Re-ranks top 1,000 candidates from previous stages. Returns top 10 for display.

Snippet Generator

For each result, generate a relevant snippet showing query terms in context
Find the passage in the document with highest query term density
Highlight matching terms in bold
Truncate to ~160 characters

HTTP

GET /api/v1/search?q=system+design+interview&page=1&lang=en&country=US
Response: 200 OK
{
  "query": "system design interview",
  "spell_correction": null,
  "results_count": 125000000,
  "results": [
    {
      "title": "System Design Interview Guide - ByteByteGo",
      "url": "https://bytebytego.com/system-design",
      "snippet": "A comprehensive guide to <b>system design interview</b> preparation...",
      "favicon": "https://bytebytego.com/favicon.ico",
      "cached_url": "...",
      "rank": 1
    },
    ...
  ],
  "related_searches": ["system design interview questions", "system design primer"],
  "knowledge_panel": null,
  "pagination": {"page": 1, "total_pages": 100}
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

Concern	Solution
Index server failure	Each index shard replicated 3×; query routed to healthy replica
Index corruption	Checksum verification; rebuild from content store
Query overload	Circuit breaker + graceful degradation (skip ML re-ranking, serve BM25 only)
Stale index	Incremental index updates; full rebuild weekly
Crawler politeness	Respect robots.txt, rate limit per domain (see #13)

Specific: Index Serving Architecture: Index Tiers

Tier 0: Most important pages (top 1B): always searched
Tier 1: Important pages (top 10B): searched if Tier 0 insufficient
Tier 2: Long tail (100B): searched only for rare queries
Tiering reduces latency: most queries are answered by Tier 0 alone

Index Update Strategy

Full rebuild: MapReduce job rebuilds entire index weekly (handles deletions, re-scoring)
Incremental updates: Real-time pipeline adds new/updated pages to a "supplement index"
Query searches both main index + supplement index; results merged
Supplement index is periodically merged into main index

Anti-Spam / Web Spam Detection

Link spam: Detect link farms (clusters of pages linking to each other artificially)
Content spam: Keyword stuffing, hidden text, cloaking detection
Click spam: Click-through rate manipulation detection
Signals: Domain age, SSL certificate, content uniqueness, link velocity

Query Result Caching

Cache top results for popular queries in Redis/Memcached
25% of queries are repeated within an hour → high cache hit ratio
Cache key: normalized query + location + language
TTL: 1 hour for most queries, 5 min for news queries

Semantic Search

Beyond keyword matching: understand query meaning
BERT / Transformer models: Encode query and documents into embedding vectors
Vector similarity search (ANN: Approximate Nearest Neighbor) using FAISS or ScaNN
Hybrid: BM25 keyword matching + semantic similarity → combined score

Interview Walkthrough

Sketch the three-stage pipeline first — crawl, index, query — so the interviewer sees you understand the full lifecycle, not just the query box.
Center the deep dive on the inverted index: term → posting list of (doc_id, positions, frequency) — this is the core data structure.
Shard the index by document ID using Sharding and Partitioning; scatter-gather queries across shards in parallel with a strict latency budget.
Allocate your 200 ms query budget explicitly: query parsing (10 ms), shard fan-out (5 ms), per-shard top-K (25 ms), merge + rerank (50 ms).
Apply Caching Patterns to popular queries — roughly 25% of searches repeat within an hour, making result caching high-impact.
Describe the ranking stack: BM25 for recall, then ML reranking on the top 1,000 candidates for precision.
Mention incremental indexing (supplement index merged periodically) to handle fresh content without full rebuilds.
Common pitfall: proposing to run LIKE '%keyword%' scans on a relational DB — interviewers expect an inverted index.

End-to-End Query Execution Flow with Timing Budget

To serve search queries within an ultra-strict < 200 ms total budget, Google structures execution across parallel stages with fine-grained millisecond caps:

Step 1: Query Understanding (10 ms): Tokenization, spell-checking, synonym query expansion (e.g. "NYC" → "New York City"), and intent classification.
Step 2: Scatter to Index Shards (5 ms network): Stateless Query Coordinator broadcasts query to 100 document-sharded Tier-0 servers in parallel. Enforces a 150 ms deadline before skipping tardy nodes.
Step 3: Per-Shard Processing (25 ms): Each shard performs compressed posting-list intersection (using fast two-pointer merges on sorted lists) + initial BM25 local scoring and local Top-100 heap selection.
Step 4: Gather + Merge (5 ms): Coordinator receives up to 10,000 local candidates and performs a fast K-Way max-heap merge to fetch the global top 1,000.
Step 5: ML Re-ranking (50 ms): Evaluates the top 1,000 candidates across hundreds of features (CTR, freshness, domain authority, PageRank) using LambdaMART + heavy BERT cross-encoders for the top 200.
Step 6: Snippet Generation (10 ms): Fetches high-density passages for the final top 10 documents, highlights matching terms, and truncates snippets.
Step 7: Response Assembly (5 ms): Inject related search queries, dynamic knowledge cards, and serialize the JSON payload.

Timing Summary: 10 + 5 + 25 + 5 + 50 + 10 + 5 = 110 ms total, well within target limits.

Loading...

Scatter-Gather: Index Shard Coordination

Stateless query coordinators scatter queries to all document shards in parallel, presenting a major tail-latency issue.

Loading...

The Tail Latency (Slow Shard) Problem:

With 100 index shards, the aggregate query latency is bound by the slowest single shard.
At 100 shards, the p99 latency of any single shard is approximately 3-5 times its median (p50) latency (due to garbage collection pauses, background indexing contention, or physical hardware issues).
The likelihood of the entire query being slowed down by at least one laggy shard scales exponentially:
P(at least one shard is slow) = 1 - (1 - 0.01)¹⁰⁰ = 63%
Without software-level mitigations, 63% of all user search queries would suffer from high tail latency.

Technical Solutions:

Hedged Requests (⭐ Recommended): Send the query to the primary replica of a shard. If it does not respond within 30 ms, send an identical speculative request to the secondary replica. Take the result from whichever replica responds first and cancel the other. This reduces tail latency drastically while adding only a small (~5%) increase in overall throughput load.
Speculative Execution / Partial Returns: If 95 out of 100 shards have returned within 100 ms, complete the query using only those 95 shards. Missing a minor 5% of candidate documents is visually and qualitatively invisible to the user.
Index Tiers: Split the index into Tier 0 (viral/popular 1B documents), Tier 1 (10B documents), and Tier 2 (100B documents). Only fallback to search Tier 1/2 if Tier 0 yields insufficient results, avoiding massive coordination fan-outs.

Shard Routing Trade-off: Document vs. Term Sharding

Document Sharding (Google's choice): Each shard stores the entire inverted index dictionary for a subset of documents. The query coordinator must broadcast queries to all shards (fixed 100 RPCs), but execution is highly uniform and simple.
Term Sharding: Shards are partitioned by lexicographical term ranges (e.g., Shard A stores terms A-C). While a query only requires querying a few shards (e.g., query "system design" needs shards for 's' and 'd' → 2 RPCs), intersecting lists across nodes is incredibly network-expensive, and popular terms create massive load imbalances.

Result Merging Across Document Shards

To accurately merge Top-100 scores returned by independent document shards, we must solve BM25 score comparability and execute sorting efficiently.

Loading...

The Global Score Comparability Problem:

The BM25 formula depends heavily on **Inverse Document Frequency (IDF)**:
IDF(q) = log((N - n(q) + 0.5) / (n(q) + 0.5))
Where N is the total documents in the corpus and n(q) is the count of documents containing the term.
If shards computed IDF using only their local document subset, the same term would carry different IDF values across shards, making the calculated BM25 scores incomparable.
Solution: Global IDF Precomputation during Offline Index Build (MapReduce)
1. Count total documents N across ALL shards.
2. Count document frequency n(q) for each term across ALL shards.
3. Compute the absolute global IDF value for each term.
4. Store these IDF values in each shard's term dictionary.
At query time, every shard uses the exact same global IDF value, making BM25 scores globally comparable and ensuring the K-way merge produces the mathematically correct global Top-K.

K-Way Merge Algorithm:

With 100 shards returning their local Top-100 results sorted by score descending, the coordinator must merge 10,000 candidates:

Create a Max-Heap of size 100 (containing one entry per shard).
Insert the first (highest scoring) result from each of the 100 shards into the heap.
Pop the maximum item from the heap → this is the global #1 result.
Push the next highest scoring result from that popped shard into the heap.
Repeat steps 3-4 until the global top-K (e.g., top 1,000) results are fully extracted.

This algorithm runs in extremely fast O(K × log(num_shards)) time, executing only around 7,000 comparisons (O(1000 × log(100))) to merge 10,000 candidates.

Tie-Breaking: If text relevance scores are identical, static PageRank authority scores are used as a secondary sort.
Near-Duplicate Deduping: Since document-sharding guarantees a document resides on exactly one shard, duplicates are impossible. For near-duplicates across different URLs, SimHash values are calculated on page text offline, and near-duplicates are filtered out in the ML re-ranking stage.

SLOs & Error Budgets

Metric	Target	Rationale
Search p99 latency	< 200ms	Users abandon after 300ms — Google internal benchmark
Index freshness	< 1 hour	News and ecommerce need recent content
Query success rate (non-empty results)	> 95%	Empty results = bad experience; spell correction helps
Index availability	99.99%	Search downtime is total product outage

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Hot term shard overload ('superbowl' during event)	Single shard p99 >2s; posting list for term exceeds memory cache; other shards normal	Replicate hot term posting list to all shards (pre-warmed for known events); enable query result cache for exact query string; rate-limit autocomplete separately from search
Bad segment merge corrupts index shard	Query results missing known docs; shard N doc count drops 30%; merge job error logs	Take shard offline; restore from replica; replay Kafka index events from last checkpoint; disable auto-merge until root cause fixed
SEO spam flood indexes 10M junk pages in 1 hour	Index growth rate 100× normal; avg BM25 scores drop; user click-through rate plummets	Emergency crawl policy tightening; bulk delete by domain pattern; reindex affected shards; add domain authority threshold to indexer gate

Cost Drivers (Staff lens)

Index storage: 1B docs × 10 KB text × 3× index overhead ≈ 30 TB (compressed posting lists + stored fields)
Query serving: 10K QPS × 100 shards × 50 ms CPU = ~500 core-shards always active
ML reranker: GPU inference on top-100 × 10K QPS = significant if not batched (batch to 50ms windows)
Kafka indexer pipeline: proportional to crawl rate, not query rate

Multi-Region & DR

Full index replicated to 3+ regions (read-only replicas). Query served locally — no cross-region index lookup. Incremental updates originate from primary crawl region, replicated via Kafka mirroring. User-specific ranking features (click history) stored regionally. Global signals (PageRank) synced daily.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Query Service (Query Understanding)

Inverted Index: Core Data Structure

Ranking Service: The Scoring Pipeline

Snippet Generator

Common Error Responses

Inverted Index (SSTable-like format on disk)

Document Store (Bigtable / GFS)

URL Frontier (for Crawler)

PageRank Store

Specific: Index Serving Architecture: Index Tiers

Index Update Strategy

Anti-Spam / Web Spam Detection

Query Result Caching

Semantic Search

Interview Walkthrough

End-to-End Query Execution Flow with Timing Budget

Scatter-Gather: Index Shard Coordination

The Tail Latency (Slow Shard) Problem:

Technical Solutions:

Shard Routing Trade-off: Document vs. Term Sharding

Result Merging Across Document Shards

The Global Score Comparability Problem:

K-Way Merge Algorithm:

Phase 1 — MVP (single-node Lucene)

Phase 2 — Sharded index + ranking pipeline

Phase 3 — Global scale (1B docs, 10K QPS)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR