Interview Prompt
Design Proximity Server (Yelp / Nearby Friends).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Geohash vs QuadTree vs S2, Spatial indexing in Redis, Radius search? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Geohash vs QuadTree vs S2
- Spatial indexing in Redis
- Radius search
- Ranking by distance + relevance
- Geo-sharding
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full payment processing (#24)
- Turn-by-turn map rendering (#54)
- Driver/rider identity verification and background checks
Assumptions
- Single metro / region unless interviewer asks for multi-city
- Mobile clients with intermittent connectivity — server is source of truth
- Managed geo + messaging infra (Kafka, Redis, RDS) is acceptable
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Search nearby places: Given user's location and a radius, return nearby businesses/places (restaurants, gas stations, etc.)
- Filter by category: Filter results by type (restaurant, hospital, ATM, etc.)
- Sort by: Distance, rating, popularity, relevance
- Business details: Name, address, phone, hours, photos, reviews
- Add/update business: Business owners can add and update their listings
- Reviews and ratings: Users can write reviews and rate businesses
- Nearby Friends (variant): Show friends who are nearby in real-time
- Low Latency: Nearby search results in < 100 ms
- High Availability: 99.99%
- Read-Heavy: 100:1 read to write (business data changes infrequently)
- Scalability: 200M+ businesses, 500M+ DAU
- Geospatial Efficiency: Efficient spatial queries over large datasets
- Real-time (for Nearby Friends): Location updates within seconds
| Metric | Calculation | Value |
|---|---|---|
| Total businesses | Given | 200M |
| Business record size | Given | 1 KB |
| Total business data | Given | 200 GB |
| Search QPS | Given | 100K |
| Business updates / day | Given | 1M (low frequency) |
| Nearby Friends: Active users streaming location | Given | 10M |
| Location updates / sec | 10M users ÷ 4s | 2.5M (10M users × 1 update per 4s) |
Core Algorithm: Geospatial Indexing
Option 1: GeoHash
Divide the world into 2 halves alternately (longitude, latitude), encode as Base32. Nearby locations share a common GeoHash prefix.
GeoHash Precision: Length 1: ~5,000 km cell Length 4: ~39 km cell Length 5: ~4.9 km cell ← Good for nearby search Length 6: ~1.2 km cell ← Good for walking distance Length 7: ~150 m cell Length 8: ~38 m cell Key property: Nearby locations share a common GeoHash prefix. "9q8yy" and "9q8yz" are adjacent cells. Search algorithm: 1. Compute user's GeoHash at precision 5 (4.9 km cells) 2. Find all businesses in user's cell: WHERE geohash LIKE '9q8yy%' 3. Also search the 8 neighboring cells (edge cases: user near cell boundary) 4. Filter by actual distance (GeoHash cells are rectangular approximations) 5. Sort by distance
Option 2: QuadTree ⭐ (recommended for in-memory)
Start with the entire world as the root node. Recursively divide each node into 4 quadrants (NW, NE, SW, SE) until a node has ≤ K businesses (e.g., K=100). Leaf nodes contain the actual business locations.
Root (World) ├── NW (Americas) │ ├── NW (Canada) │ ├── NE (NE US) → [biz1, biz2, ..., biz87] (leaf: 87 businesses) │ ├── SW (SW US) │ │ ├── NW (Bay Area) → [biz1, ..., biz95] (leaf) │ │ ├── NE (Central CA) │ │ └── ... │ └── SE (SE US) ├── NE (Europe/Asia) └── ... Search algorithm: 1. Start at root, traverse to leaf containing user's location 2. Collect all businesses in that leaf 3. If not enough results (radius extends beyond leaf) → check parent's sibling leaves 4. Continue expanding until enough results found or radius exceeded 5. Filter by actual distance Building QuadTree: O(N log N), where N = number of businesses Search: O(log N + M), where M = businesses in result Memory: 200M businesses × 32 bytes = ~6.4 GB (fits in memory on a single server) Rebuild: Every night. Incremental updates during the day
Option 3: Google S2 Geometry
- Uses a space-filling curve (Hilbert curve) to map 2D space to 1D
- Each cell has a 64-bit Cell ID
- Minimal cell covering set logic for arbitrary shapes and polygons
- More uniform cell sizes than GeoHash
- Used by: Google Maps, Foursquare
Option 4: Uber H3 Index
- Hexagonal grid system rather than square/rectangular
- Hexagons have uniform distance to all 6 immediate neighbors (unlike squares, where diagonal distance is different)
- Resolution 9 (~174m edge) for precise tracking; Resolution 7 (~1.2km edge) for surge pricing zones
- Finding neighbors is a fast O(1) memory lookup
Proximity Search Service
- Receive query:
{lat, lng, radius, category, sort_by} - Query geospatial index (QuadTree) for businesses in radius
- Filter by category
- Rank by sort criteria (distance, rating, relevance)
- Enrich with business details from Redis cache (or fallback to MySQL)
- Return paginated results
Business Service
- CRUD operations for business listings
- On create/update → update MySQL → async update geospatial index
- Hot business profiles cached in Redis to minimize database loads
Kafka CDC: Incremental QuadTree Updates
- Full QuadTree rebuild from MySQL runs nightly (~5 minutes for 200M businesses)
- Debezium CDC on MySQL captures row-level changes → publishes to Kafka topic
business-changes - QuadTree server consumes changes:
- Insert: Add business to leaf node; if leaf exceeds K=100, split into 4 quadrants
- Update location: Remove from old leaf, insert into new leaf
- Update metadata: Update in-place
- Delete: Remove from leaf node
- CDC stream also updates Redis business cache and Redis GeoIndex
- Lag: business changes reflected in search within ~2 seconds
Nearby Friends: Real-Time Architecture
- Different from business search (businesses are static; friends move)
- How it works:
- Each user streams location every 4 seconds via WebSocket
- Location stored in Redis Geo Set (user_locations)
- When user opens "Nearby Friends": query Redis for each friend's distance
- Optimization: Use GeoHash to pre-filter and only check friends in same/adjacent GeoHash cells
Event Bus Design (Kafka)
Topic: proximity_server-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "proximity_server-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: proximity_server-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Proximity Server (Yelp / Nearby Friends): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Search Nearby
GET /api/v1/search/nearby?lat=37.7749&lng=-122.4194&radius=5000&category=restaurant&sort=distance&limit=20
Response: 200 OK
{
"businesses": [
{
"business_id": "biz-uuid",
"name": "Joe's Pizza",
"category": "restaurant",
"address": "123 Main St, San Francisco, CA",
"lat": 37.7751,
"lng": -122.4190,
"distance_meters": 45,
"rating": 4.5,
"review_count": 234,
"price_level": "$$",
"is_open": true,
"photo_url": "..."
}
],
"total": 85
}Get Business Details
GET /api/v1/businesses/{business_id}Add/Update Business
POST /api/v1/businesses
{ "name": "Joe's Pizza", "lat": 37.7751, "lng": -122.4190, ... }Nearby Friends (Real-Time Variant)
WebSocket /ws/v1/nearby-friends
{
"type": "location_update",
"lat": 37.7749,
"lng": -122.4194
}
Server Push:
{
"type": "nearby_friends",
"friends": [
{"user_id": "...", "name": "Alice", "distance_meters": 120, "last_updated": "10s ago"}
]
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 440 Login Timeout: WebSocket session expired; reconnect required
MySQL: Business Listings
CREATE TABLE businesses (
business_id BIGINT PRIMARY KEY,
name VARCHAR(256),
category VARCHAR(64),
address TEXT,
city VARCHAR(128),
state VARCHAR(64),
country VARCHAR(64),
lat DECIMAL(10,7),
lng DECIMAL(10,7),
geohash VARCHAR(12),
phone VARCHAR(20),
website TEXT,
price_level TINYINT,
rating DECIMAL(2,1),
review_count INT DEFAULT 0,
hours JSON,
photos JSON,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP,
updated_at TIMESTAMP,
INDEX idx_geohash (geohash),
INDEX idx_category_geohash (category, geohash),
INDEX idx_city (city)
);In-Memory QuadTree Node
class QuadTreeNode {
double x1, y1, x2, y2; // bounding box
List<Business> businesses; // non-null only for leaf nodes (max 100)
QuadTreeNode NW, NE, SW, SE; // children (null for leaves)
}
class Business {
long businessId;
double lat, lng;
String category;
float rating;
}Redis: Business Cache & GeoIndex
# Business details cache
Key: biz:{business_id}
Value: Hash { name, category, lat, lng, rating, ... }
TTL: 3600
# Redis GeoIndex (alternative to in-memory QuadTree)
GEOADD businesses:restaurant {lng} {lat} {business_id}
GEORADIUS businesses:restaurant {lng} {lat} 5 km WITHCOORD WITHDIST COUNT 20 ASCRedis: Nearby Friends (Real-Time)
# User's current location
GEOADD user_locations {lng} {lat} {user_id}
# Query friends nearby
for friend_id in user.friends:
GEODIST user_locations {user_id} {friend_id} m
if distance < 5000:
add to result| Concern | Solution |
|---|---|
| QuadTree server failure | Multiple replicas; rebuild from MySQL in < 5 minutes |
| MySQL read failure | Read replicas in multiple AZs |
| Stale geospatial index | Incremental updates during day; full rebuild nightly |
| Cache miss storm | Request coalescing + stale-while-revalidate |
Specific: GeoHash / Cell Boundary Problem
- A user at the edge of a cell might miss businesses just across the border
- Solution: Always search the 8 neighboring cells in addition to the user's cell
- For QuadTree: expand search to sibling nodes when results near the boundary
Scaling the QuadTree
- 200M businesses × 32 bytes ≈ 6.4 GB → fits easily on a single server
- Replicate to multiple servers (read replicas) for high QPS
- For extremely large datasets: shard by geographic region (US, Europe, Asia)
Density-Based Search Radius
- In Manhattan (dense): default radius = 500 m
- In rural areas: default radius = 50 km
- Adaptive: start with small radius, expand until >= 10 results
Business Hours and Open/Closed Status
- Store hours per day as JSON:
{"mon": {"open": "09:00", "close": "22:00"}, ...} - On search: filter by "open now" using user's timezone (determined from location via timezone lookup table)
Interview Walkthrough
- Frame as a geospatial nearest-neighbor problem with strict latency SLO — index choice (geohash grid vs quadtree) comes first.
- Explain location update ingestion via WebSocket and write to a spatial index sharded by geohash prefix.
- Cover privacy: quantize coordinates to grid cells, never expose exact GPS to friends without consent.
- Discuss Caching Patterns for hot urban cells where query density exceeds single-shard capacity.
- Mention ephemeral location TTL — stale entries must expire or friends see phantom nearby users.
- Common pitfall: haversine distance on every user pair — O(n²) brute force fails; always index first, filter second.
GeoHash vs QuadTree vs Redis GeoSet vs S2 vs H3
Approach Query Type Scale Pros Cons Best For
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
GeoHash (DB index) Radius 200M+ Simple, SQL compatible, B-tree index Cell boundary edge cases, need 8-neighbor Small datasets, MySQL/Postgres geo
search
QuadTree (in-mem) Radius + arb 200M+ Fast, <6.4 GB for 200M biz, flexible Requires dedicated servers, rebuild Yelp-scale static data
query overhead
Redis GeoSet Radius only 100M Zero infrastructure, built-in No polygon queries, precision degrades Simple proximity (Tinder, Uber)
GEORADIUS near poles
Google S2 Polygon/rad Unlimited Most accurate, hierarchical, Application-level library, no built-in DB Google Maps, complex geo queries
Hilbert curve
PostGIS Any 100M Full spatial SQL, arbitrary shapes Harder to scale horizontally Complex shapes, GIS appsRule of thumb: Static data (businesses, POIs) → QuadTree or GeoHash. Moving data (users, vehicles, drivers) → Redis GeoSet.
Grid Search vs Pure Radius Trade-off
Two common implementations for "find businesses within 5 km":
Approach 1: Cell-based (GeoHash / QuadTree)
Find the cell(s) covering the search area → return all businesses in those cells
Pros: Fast! Single index lookup
Cons: Returns businesses in rectangular area, not a circle
→ Must post-filter: calculate actual distance for each result
→ More results than needed (corner of rectangle extends beyond radius)
Optimization: use tighter cells (resolution 6: ~1.2 km)
More cells to query (25 cells for 5 km radius at res 6 vs 9 at res 5)
But fewer false positives → less post-filtering
Approach 2: Sorted distance (brute force in memory)
Load all nearby businesses, compute exact Haversine distance, sort
Only works if "nearby" is well-bounded (< 100K candidates)
For dense cities: 100K businesses in 5 km → too slow
For sparse areas: 50 businesses in 50 km → trivial
Best practice:
1. Cell query to get candidate set (~1K businesses)
2. Exact Haversine distance calculation to filter to actual radius
3. Sort by distance, paginate (no need to return all 1K)
Flat-earth approximation (fast, no trig, accurate to 0.1% for < 50 km):
d = sqrt((Δlat)² + (Δlng × cos(lat))²) × 111 km/degreeNearby Friends: Location Privacy and Precision Levels
Privacy concern: sharing exact GPS coordinates exposes home/work locations. Solution: Use precision-reduced location sharing Level 1 — Exact (high privacy risk): Share full GPS coordinates: lat=37.774929, lng=-122.419418 Distance accurate to meters → reveals exact building AVOID for social features Level 2 — Fuzzy location (recommended for Nearby Friends) ⭐: Truncate coordinates to 3 decimal places: lat=37.774, lng=-122.419 Precision: ~100 meter radius "Alice is about 500m away" — can't pinpoint which house Still accurate enough for "Alice is in this neighborhood" Level 3 — GeoHash cell (for "in the area"): Share GeoHash of precision 5 (4.9 km × 4.9 km cells) "Alice is in this part of downtown" — approximate zone, not location Works for: "Friends in the same neighborhood" features Implementation: User shares precise GPS with backend (needed for computation) Backend stores precise location in Redis (TTL: 5 min) API response: returns ONLY distance in meters, never raw coordinates Frontend: shows "Alice is 800m away" → cannot triangulate exact position Privacy toggle: user can set "share location with friends" ON/OFF When OFF: friends see "Alice is in San Francisco" (city-level only)
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core proximity server flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.