Design Google Typeahead / Autocomplete

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Latency is everything — p99 <50ms. Nail trie + top-K per prefix, how prefix sharding works, and why you can't just query Elasticsearch. Show capacity math for query QPS vs index update rate.
Arch 50	Add personalization (recent searches boost), fuzzy matching for typos, and multi-tenant isolation.
Arch 75	Staff: sampling-based frequency estimation at scale, index rebuild without downtime, and adversarial query patterns.

Interview Prompt

Design a typeahead/autocomplete system like Google Search suggestions or Amazon product search. As the user types, return the top 5-10 suggestions ranked by popularity within 50ms.

Clarifying Questions (ask before designing)

Question	Why it matters
What's the query volume and corpus size?	10K QPS with 10M terms is trie-in-memory; 100K QPS with 1B terms needs sharded prefix index.
Static corpus or dynamic (updates from user queries)?	Dynamic requires streaming frequency updates — trie with mutable top-K is harder than rebuild-from-batch.
Exact prefix match only, or fuzzy/typo tolerance?	Fuzzy adds edit-distance computation — usually a separate spell-check path, not the hot trie lookup.
Personalized suggestions or global top-K?	Personalization adds user context to ranking — second lookup layer after prefix match.

Scope

In scope

Prefix-based autocomplete with top-K results
Trie or equivalent prefix index structure
Frequency-based ranking with approximate updates
Low-latency serving architecture (<50ms p99)
Prefix sharding for horizontal scale

Out of scope (state explicitly)

Full web search results (only suggestions)
Natural language understanding / intent detection
Multi-language morphology (stemming per language)
Frontend UI implementation

Assumptions

100K autocomplete QPS peak, 10K avg
Corpus: 100M unique terms/phrases, avg 20 chars
Top-K = 10 suggestions per prefix
Index updates: 1M new query events/day for frequency refresh

As the user types, suggest top 5-10 matching search queries in real-time
Suggestions ranked by popularity (search frequency)
Support prefix matching (typing "sys" → "system design", "system architecture")
Update suggestions based on new/trending search data
Handle multi-language queries
Personalized suggestions (based on user's search history: optional)

Metric	Calculation	Value
DAU	Given	500M
Searches / day	500M DAU × 10	5B
Autocomplete requests / search	Given	6 (avg 6 keystrokes before selecting)
Autocomplete requests / day	5B searches × 6	30B
Autocomplete requests / sec	Derived from daily volume ÷ 86400 (+ peak factor)	~350K
Unique query terms	Given	1B
Avg query length	Given	20 characters
Trie storage (with counts)	Given	~50 GB

Loading...

Trie Data Structure: The Core Algorithm

Standard Trie:

A trie node stores a character, a count of how many times the full path forms a top query, and child pointers. For the prefix "sy", the path traverses root → s → y and exposes completions like:

"system": count 50,000
"system design": count 20,000
"system architecture": count 5,000

Each node caches its top-3 completions so a prefix lookup is O(prefix length), not a subtree traversal. On write, the counts propagate upward and the cached top-3 lists are refreshed asynchronously.

Root
└── s
    └── y
        └── s
            └── t
                └── e
                    └── m (★ "system" count=50000)
                        └── d → e → s → i → g → n (★ "system design" count=20000)
                        └── a → r → c → h (★ "system architecture" count=5000)

Optimization 1: Store Top-K results at each node: At each trie node, pre-compute and cache the top 10 most popular completions. When user types "sys" → go to node 's' → 'y' → 's' → return pre-cached top 10. Avoids traversing the entire subtree at query time → O(prefix_length) lookup.

Node 'sys':
  top_10: [
    ("system design", 20000),
    ("system architecture", 5000),
    ("system requirements", 3000),
    ...
  ]

Optimization 2: Compressed Trie (Radix Tree / Patricia Trie): Merge nodes with single children: s → y → s → t → e → m becomes "system" as one node. Reduces memory by 50-70%.

Optimization 3: Trie Serialization: Serialize trie as a flat byte array for compact storage and fast loading. Use memory-mapped files for instant startup.

Offline Data Pipeline

Search Logs: Every search query logged with timestamp, user_id (anonymized), location
Kafka: Buffers search events
Flink/Spark Aggregation: Count query frequency in sliding windows (hourly, daily, weekly). Apply decay function: score = Σ (count_i × decay^(age_in_hours)). Recent queries weigh more than old ones. Filter: Remove queries with < 5 occurrences (noise), profanity, PII. Output: query_string → popularity_score to Aggregated Query DB
Trie Builder: Read aggregated query-score pairs. Build trie in memory. At each node, compute top-K descendants. Serialize to binary blob → upload to S3. Frequency: Rebuild every 15 minutes (trending) or hourly (stable)

Autocomplete Service

Stateless: Each server loads the trie blob into memory on startup
Trie refresh: New trie blob published to S3 → servers poll every N minutes → atomically swap old trie with new trie (double-buffering)
Lookup: trie.search(prefix) → returns pre-cached top 10 suggestions in O(prefix_length)
Scaling: Horizontally scale by adding servers (each holds a full copy of the trie). For very large tries: shard by first character(s) of the query

CDN Caching

Popular prefixes ("how to", "what is", "why") are cached at CDN edge
TTL: 1 hour (popularity is relatively stable for common prefixes)
Cache key: autocomplete:{prefix}
Impact: 30-50% of autocomplete requests served from CDN

Client-Side Optimizations

Debouncing: Only send request after 100-200 ms of no typing (reduces 70% of requests)
Local cache: Cache previous responses. "syst" → "syste" → reuse results from "syst" client-side
Prefetch: After showing suggestions for "sys", prefetch "syst", "sysa", etc. in background

Autocomplete

HTTP

GET /api/v1/autocomplete?q=system+des&limit=10&lang=en
Response: 200 OK
{
  "suggestions": [
    {"query": "system design", "score": 20000},
    {"query": "system design interview", "score": 15000},
    {"query": "system design primer", "score": 8000},
    {"query": "system design patterns", "score": 5000},
    {"query": "system design basics", "score": 3000}
  ]
}

Log Search (Internal)

HTTP

POST /api/v1/search-log (internal)
{
  "query": "system design",
  "user_id": "anon-hash",
  "timestamp": "2026-03-13T10:00:00Z",
  "location": "US"
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

In-Memory Trie Node

JAVA

class TrieNode {
    Map<Character, TrieNode> children;   // or array[26] for ASCII lowercase
    boolean isEnd;                        // marks a complete query
    List<Pair<String, Long>> topK;       // pre-cached top-K suggestions with scores
}

Aggregated Query DB (Cassandra / DynamoDB)

SQL

-- Stores aggregated query frequencies
Key:    query_string (partition key)
Value:  {
          hourly_count: 500,
          daily_count: 5000,
          weekly_count: 25000,
          score: 15234.5,         -- weighted score with decay
          last_updated: timestamp
        }

Kafka Topic: search-events

JSON

{
  "query": "system design",
  "user_id": "anon-hash-uuid",
  "timestamp": "2026-03-13T10:00:00Z",
  "country": "US",
  "language": "en",
  "result_count": 1250000
}

Trie Blob (Serialized Binary Format)

Header: { version, node_count, total_queries, built_at }
Nodes:  [ { char, is_end, child_count, child_offsets[], topK[] } ]
Strings: [ offset → query_string ]

Stored in S3, ~5-50 GB depending on query corpus
Memory-mapped for fast loading

Concern	Solution
Trie server crash	Stateless; LB routes to healthy instances. New instance loads trie from S3
Trie build failure	Servers keep serving old trie; alert ops. Fallback: serve last known good trie
Stale suggestions	Acceptable: users get suggestions from 15-60 min ago
S3 unavailable	Trie blob cached locally on each server; survives S3 outage
Data pipeline lag	Trending topics appear with slight delay; fallback to existing trie data

Specific: Trie Hot-Swap Without Downtime

New trie blob built and uploaded to S3
Each server has two trie slots (A and B): double buffering
Server loads new trie into inactive slot (B) while serving from active slot (A)
Atomic pointer swap: active = B. Old trie (A) is garbage collected
Zero downtime, zero latency spike during refresh

Handling Trending / Breaking News

Standard trie rebuild (hourly) is too slow for breaking news
Solution: Maintain a small "trending overlay" trie updated every 5 minutes
At query time: merge results from main trie + trending trie
Trending trie is built from Flink real-time stream (last 1-hour window)

Personalized Suggestions

Augment global suggestions with user's own search history
Client stores recent 100 queries locally → searched client-side first
Merge: Show 2-3 personalized + 7-8 global suggestions
Privacy: Personalized suggestions computed client-side (no server-side history needed)

Multi-Language Support

Separate trie per language (detected from user's locale or keyboard input)
CJK (Chinese, Japanese, Korean): Tokenization differs: use n-gram based approach instead of character trie

Handling Offensive / Sensitive Queries

Blocklist of offensive terms: filtered out of suggestions
PII detection: Don't suggest queries containing email addresses, phone numbers, SSNs
Legal/DMCA: Remove specific queries upon legal request

Sampling for Scale

Don't log/count every single search query
Sample 1 in 10 (or 1 in 100) queries for frequency estimation
Multiply counts accordingly: statistically valid for large-scale trends
Reduces data pipeline cost by 10-100x

Spell Correction Integration

If no suggestions found for a prefix, try spelling correction
Use edit distance (Levenshtein) to find closest matching prefix in trie
"systm desi" → "system desi" → suggestions for "system design"

Interview Walkthrough

State latency target first (<100ms p99) — this forces an in-memory trie over database prefix queries.
Describe the trie structure: each node stores top-K suggestions for its prefix, ranked by popularity or click-through rate.
Shard tries by first character or prefix range across servers; replicate hot shards for read scaling.
Update suggestions asynchronously via a stream of search logs — batch rebuilds nightly, incremental updates for trending queries.
Add a CDN or edge cache layer for the most common prefixes ("a", "th", "the") to absorb global traffic spikes.
Integrate personalization as a second signal blended with global popularity — keep the trie as fallback for cold users.
Common pitfall: querying the database with LIKE 'prefix%' on every keystroke — full table scans cannot meet sub-100ms at scale.

SLOs & Error Budgets

Metric	Target	Rationale
Autocomplete p99 latency	< 50ms	User types fast — lag feels broken
Autocomplete p50 latency	< 10ms	Most queries hit hot-prefix cache
Index freshness	< 1 hour	Frequency updates don't need real-time
Availability	99.99%	Autocomplete is on every search page — outage is visible

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Hot prefix shard ('co' for COVID-related surge) overloaded	Shard 'co' p99 latency >200ms; CPU 95%; other shards normal	Emergency replicate 'co' shard to 10× replicas; enable CDN edge cache for top-100 'co*' prefixes; rate-limit autocomplete to 50 req/sec per IP as circuit breaker
Bad batch index deploys offensive suggestions	User reports + automated blocklist scan flags terms in top-K	Instant rollback to previous index version (blue-green swap <30s); add terms to blocklist filter (post-trie lookup safety net); root cause: poisoned query log data in frequency aggregation
Flink aggregation lag causes stale trending suggestions	Known trending term missing from autocomplete; Flink consumer lag >2 hours	Bypass Flink — inject trending terms via manual override API (Redis sorted set with TTL); scale Flink workers; accept 2h stale global index while override covers gap

Cost Drivers (Staff lens)

Trie shard fleet: 4096 shards × 3 replicas × 8 GB RAM ≈ 100 TB aggregate RAM (pruned trie)
Kafka query log ingestion: 100K QPS × 50 bytes = 5 MB/sec — modest
Batch rebuild compute: nightly Spark job, 500 executor-hours for 100M terms
Personalization Redis: 300M users × 20 recent searches × 30 bytes ≈ 180 GB

Multi-Region & DR

Trie index is read-only and replicated globally — same snapshot deployed to all regions. Query logs aggregated to single region for frequency updates (avoids double-counting). User personalization data stays regional. Prefix router directs to nearest shard replica via geo-DNS. Index deploy is global coordinated — all regions swap within 5-min window.