This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Latency is everything — p99 <50ms. Nail trie + top-K per prefix, how prefix sharding works, and why you can't just query Elasticsearch. Show capacity math for query QPS vs index update rate. |
| Arch 50 | Add personalization (recent searches boost), fuzzy matching for typos, and multi-tenant isolation. |
| Arch 75 | Staff: sampling-based frequency estimation at scale, index rebuild without downtime, and adversarial query patterns. |
Interview Prompt
Design a typeahead/autocomplete system like Google Search suggestions or Amazon product search. As the user types, return the top 5-10 suggestions ranked by popularity within 50ms.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What's the query volume and corpus size? | 10K QPS with 10M terms is trie-in-memory; 100K QPS with 1B terms needs sharded prefix index. |
| Static corpus or dynamic (updates from user queries)? | Dynamic requires streaming frequency updates — trie with mutable top-K is harder than rebuild-from-batch. |
| Exact prefix match only, or fuzzy/typo tolerance? | Fuzzy adds edit-distance computation — usually a separate spell-check path, not the hot trie lookup. |
| Personalized suggestions or global top-K? | Personalization adds user context to ranking — second lookup layer after prefix match. |
Scope
In scope
- Prefix-based autocomplete with top-K results
- Trie or equivalent prefix index structure
- Frequency-based ranking with approximate updates
- Low-latency serving architecture (<50ms p99)
- Prefix sharding for horizontal scale
Out of scope (state explicitly)
- Full web search results (only suggestions)
- Natural language understanding / intent detection
- Multi-language morphology (stemming per language)
- Frontend UI implementation
Assumptions
- 100K autocomplete QPS peak, 10K avg
- Corpus: 100M unique terms/phrases, avg 20 chars
- Top-K = 10 suggestions per prefix
- Index updates: 1M new query events/day for frequency refresh
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- As the user types, suggest top 5-10 matching search queries in real-time
- Suggestions ranked by popularity (search frequency)
- Support prefix matching (typing "sys" → "system design", "system architecture")
- Update suggestions based on new/trending search data
- Handle multi-language queries
- Personalized suggestions (based on user's search history: optional)
- Ultra-Low Latency: Suggestions must appear within 50-100 ms
- High Availability: 99.99%
- Scalability: Handle 100K+ autocomplete requests/sec
- Freshness: New trending terms appear within minutes to hours
- Consistency: Eventual consistency acceptable (slightly stale suggestions OK)
- Fault Tolerant: System degrades gracefully (show slightly stale suggestions)
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given | 500M |
| Searches / day | 500M DAU × 10 | 5B |
| Autocomplete requests / search | Given | 6 (avg 6 keystrokes before selecting) |
| Autocomplete requests / day | 5B searches × 6 | 30B |
| Autocomplete requests / sec | Derived from daily volume ÷ 86400 (+ peak factor) | ~350K |
| Unique query terms | Given | 1B |
| Avg query length | Given | 20 characters |
| Trie storage (with counts) | Given | ~50 GB |
Trie Data Structure: The Core Algorithm
Standard Trie:
A trie node stores a character, a count of how many times the full path forms a top query, and child pointers. For the prefix "sy", the path traverses root → s → y and exposes completions like:
- "system": count 50,000
- "system design": count 20,000
- "system architecture": count 5,000
Each node caches its top-3 completions so a prefix lookup is O(prefix length), not a subtree traversal. On write, the counts propagate upward and the cached top-3 lists are refreshed asynchronously.
Root
└── s
└── y
└── s
└── t
└── e
└── m (★ "system" count=50000)
└── d → e → s → i → g → n (★ "system design" count=20000)
└── a → r → c → h (★ "system architecture" count=5000)Optimization 1: Store Top-K results at each node: At each trie node, pre-compute and cache the top 10 most popular completions. When user types "sys" → go to node 's' → 'y' → 's' → return pre-cached top 10. Avoids traversing the entire subtree at query time → O(prefix_length) lookup.
Node 'sys':
top_10: [
("system design", 20000),
("system architecture", 5000),
("system requirements", 3000),
...
]Optimization 2: Compressed Trie (Radix Tree / Patricia Trie): Merge nodes with single children: s → y → s → t → e → m becomes "system" as one node. Reduces memory by 50-70%.
Optimization 3: Trie Serialization: Serialize trie as a flat byte array for compact storage and fast loading. Use memory-mapped files for instant startup.
Offline Data Pipeline
- Search Logs: Every search query logged with timestamp, user_id (anonymized), location
- Kafka: Buffers search events
- Flink/Spark Aggregation: Count query frequency in sliding windows (hourly, daily, weekly). Apply decay function: score = Σ (count_i × decay^(age_in_hours)). Recent queries weigh more than old ones. Filter: Remove queries with < 5 occurrences (noise), profanity, PII. Output:
query_string → popularity_scoreto Aggregated Query DB - Trie Builder: Read aggregated query-score pairs. Build trie in memory. At each node, compute top-K descendants. Serialize to binary blob → upload to S3. Frequency: Rebuild every 15 minutes (trending) or hourly (stable)
Autocomplete Service
- Stateless: Each server loads the trie blob into memory on startup
- Trie refresh: New trie blob published to S3 → servers poll every N minutes → atomically swap old trie with new trie (double-buffering)
- Lookup: trie.search(prefix) → returns pre-cached top 10 suggestions in O(prefix_length)
- Scaling: Horizontally scale by adding servers (each holds a full copy of the trie). For very large tries: shard by first character(s) of the query
CDN Caching
- Popular prefixes ("how to", "what is", "why") are cached at CDN edge
- TTL: 1 hour (popularity is relatively stable for common prefixes)
- Cache key:
autocomplete:{prefix} - Impact: 30-50% of autocomplete requests served from CDN
Client-Side Optimizations
- Debouncing: Only send request after 100-200 ms of no typing (reduces 70% of requests)
- Local cache: Cache previous responses. "syst" → "syste" → reuse results from "syst" client-side
- Prefetch: After showing suggestions for "sys", prefetch "syst", "sysa", etc. in background
Autocomplete
GET /api/v1/autocomplete?q=system+des&limit=10&lang=en
Response: 200 OK
{
"suggestions": [
{"query": "system design", "score": 20000},
{"query": "system design interview", "score": 15000},
{"query": "system design primer", "score": 8000},
{"query": "system design patterns", "score": 5000},
{"query": "system design basics", "score": 3000}
]
}Log Search (Internal)
POST /api/v1/search-log (internal)
{
"query": "system design",
"user_id": "anon-hash",
"timestamp": "2026-03-13T10:00:00Z",
"location": "US"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
In-Memory Trie Node
class TrieNode {
Map<Character, TrieNode> children; // or array[26] for ASCII lowercase
boolean isEnd; // marks a complete query
List<Pair<String, Long>> topK; // pre-cached top-K suggestions with scores
}Aggregated Query DB (Cassandra / DynamoDB)
-- Stores aggregated query frequencies
Key: query_string (partition key)
Value: {
hourly_count: 500,
daily_count: 5000,
weekly_count: 25000,
score: 15234.5, -- weighted score with decay
last_updated: timestamp
}Kafka Topic: search-events
{
"query": "system design",
"user_id": "anon-hash-uuid",
"timestamp": "2026-03-13T10:00:00Z",
"country": "US",
"language": "en",
"result_count": 1250000
}Trie Blob (Serialized Binary Format)
Header: { version, node_count, total_queries, built_at }
Nodes: [ { char, is_end, child_count, child_offsets[], topK[] } ]
Strings: [ offset → query_string ]- Stored in S3, ~5-50 GB depending on query corpus
- Memory-mapped for fast loading
| Concern | Solution |
|---|---|
| Trie server crash | Stateless; LB routes to healthy instances. New instance loads trie from S3 |
| Trie build failure | Servers keep serving old trie; alert ops. Fallback: serve last known good trie |
| Stale suggestions | Acceptable: users get suggestions from 15-60 min ago |
| S3 unavailable | Trie blob cached locally on each server; survives S3 outage |
| Data pipeline lag | Trending topics appear with slight delay; fallback to existing trie data |
Specific: Trie Hot-Swap Without Downtime
- New trie blob built and uploaded to S3
- Each server has two trie slots (A and B): double buffering
- Server loads new trie into inactive slot (B) while serving from active slot (A)
- Atomic pointer swap: active = B. Old trie (A) is garbage collected
- Zero downtime, zero latency spike during refresh
Handling Trending / Breaking News
- Standard trie rebuild (hourly) is too slow for breaking news
- Solution: Maintain a small "trending overlay" trie updated every 5 minutes
- At query time: merge results from main trie + trending trie
- Trending trie is built from Flink real-time stream (last 1-hour window)
Personalized Suggestions
- Augment global suggestions with user's own search history
- Client stores recent 100 queries locally → searched client-side first
- Merge: Show 2-3 personalized + 7-8 global suggestions
- Privacy: Personalized suggestions computed client-side (no server-side history needed)
Multi-Language Support
- Separate trie per language (detected from user's locale or keyboard input)
- CJK (Chinese, Japanese, Korean): Tokenization differs: use n-gram based approach instead of character trie
Handling Offensive / Sensitive Queries
- Blocklist of offensive terms: filtered out of suggestions
- PII detection: Don't suggest queries containing email addresses, phone numbers, SSNs
- Legal/DMCA: Remove specific queries upon legal request
Sampling for Scale
- Don't log/count every single search query
- Sample 1 in 10 (or 1 in 100) queries for frequency estimation
- Multiply counts accordingly: statistically valid for large-scale trends
- Reduces data pipeline cost by 10-100x
Spell Correction Integration
- If no suggestions found for a prefix, try spelling correction
- Use edit distance (Levenshtein) to find closest matching prefix in trie
- "systm desi" → "system desi" → suggestions for "system design"
Interview Walkthrough
- State latency target first (<100ms p99) — this forces an in-memory trie over database prefix queries.
- Describe the trie structure: each node stores top-K suggestions for its prefix, ranked by popularity or click-through rate.
- Shard tries by first character or prefix range across servers; replicate hot shards for read scaling.
- Update suggestions asynchronously via a stream of search logs — batch rebuilds nightly, incremental updates for trending queries.
- Add a CDN or edge cache layer for the most common prefixes ("a", "th", "the") to absorb global traffic spikes.
- Integrate personalization as a second signal blended with global popularity — keep the trie as fallback for cold users.
- Common pitfall: querying the database with
LIKE 'prefix%'on every keystroke — full table scans cannot meet sub-100ms at scale.
Flink Streaming Aggregation for Trending Detection
Problem: Hourly trie rebuild is too slow for breaking news. "earthquake in Tokyo" happens at 2:15 PM → must appear in suggestions by 2:20 PM.
Exactly-once guarantees: Flink checkpointing (every 30s) + Kafka transactional producer. On failure: Flink restores from checkpoint, replays Kafka from offset: no duplicate counts, no lost events.
Decay Function Scoring: Worked Example
Recent searches should weigh more than old searches. Exponential decay: score = Σ count_i × decay^(age_in_hours) with decay = 0.97 (lose 3% weight per hour). Smoother than linear decay which has an abrupt cutoff at T hours.
Tuning the decay rate: 0.99 = very slow, 0.97 = moderate (default), 0.90 = fast. Different rates for different verticals: news queries decay=0.90, product queries decay=0.99.
Trie Sharding for Very Large Query Corpuses
Approach 1: Shard by first character(s): simple routing but uneven shards and can't serve cross-shard suggestions.
Approach 2: Full trie replicated on every server: each server can answer any query. 50 GB × 20 servers = 1 TB RAM total. This is the preferred approach: 50 GB RAM is cheap and the operational simplicity is worth it.
Approach 3: Two-level trie: Level 1 (all servers): top 100K queries (fits in 1 GB) covers 80%+ of requests. Level 2 (sharded): long-tail queries. Fast for common queries but higher latency for rare queries.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — MVP (single trie in memory)
Single server loads pre-built trie from batch job (nightly). Top-10 per node. Redis cache for hot prefixes ('a', 's', 't'). Serves 1K QPS.
Key components: In-memory trie · Nightly batch builder · Redis hot-prefix cache
Move to next phase when: Memory exceeds 64 GB; or QPS exceeds 10K on single host
Phase 2 — Sharded prefix index
4096 shards by 2-char prefix, 3 replicas each. Prefix router with consistent hashing. Kafka + Flink for streaming frequency updates. Hourly micro-rebuild of dirty subtrees.
Key components: Sharded trie fleet · Prefix router · Kafka + Flink · Blue-green deploy
Move to next phase when: Personalization required; or corpus grows to 1B terms
Phase 3 — Personalized + fuzzy
Two-stage: global trie top-20 → personalized rerank using user's recent searches (Redis sorted set). Fuzzy fallback via separate edit-distance index (SymSpell, 1-edit neighbors precomputed). A/B test framework for ranking tweaks.
Key components: Personalization layer · SymSpell fuzzy index · A/B testing · User history store
Move to next phase when: Multi-tenant SaaS (each customer own corpus); or sub-10ms p99 SLA
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Autocomplete p99 latency | < 50ms | User types fast — lag feels broken |
| Autocomplete p50 latency | < 10ms | Most queries hit hot-prefix cache |
| Index freshness | < 1 hour | Frequency updates don't need real-time |
| Availability | 99.99% | Autocomplete is on every search page — outage is visible |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Hot prefix shard ('co' for COVID-related surge) overloaded | Shard 'co' p99 latency >200ms; CPU 95%; other shards normal | Emergency replicate 'co' shard to 10× replicas; enable CDN edge cache for top-100 'co*' prefixes; rate-limit autocomplete to 50 req/sec per IP as circuit breaker |
| Bad batch index deploys offensive suggestions | User reports + automated blocklist scan flags terms in top-K | Instant rollback to previous index version (blue-green swap <30s); add terms to blocklist filter (post-trie lookup safety net); root cause: poisoned query log data in frequency aggregation |
| Flink aggregation lag causes stale trending suggestions | Known trending term missing from autocomplete; Flink consumer lag >2 hours | Bypass Flink — inject trending terms via manual override API (Redis sorted set with TTL); scale Flink workers; accept 2h stale global index while override covers gap |
Cost Drivers (Staff lens)
- Trie shard fleet: 4096 shards × 3 replicas × 8 GB RAM ≈ 100 TB aggregate RAM (pruned trie)
- Kafka query log ingestion: 100K QPS × 50 bytes = 5 MB/sec — modest
- Batch rebuild compute: nightly Spark job, 500 executor-hours for 100M terms
- Personalization Redis: 300M users × 20 recent searches × 30 bytes ≈ 180 GB
Multi-Region & DR
Trie index is read-only and replicated globally — same snapshot deployed to all regions. Query logs aggregated to single region for frequency updates (avoids double-counting). User personalization data stays regional. Prefix router directs to nearest shard replica via geo-DNS. Index deploy is global coordinated — all regions swap within 5-min window.