This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Geofencing Service.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Point-in-polygon tests, Geofence event triggers, R-tree indexing? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Point-in-polygon tests
- Geofence event triggers
- R-tree indexing
- Batch vs streaming detection
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full payment processing (#24)
- Turn-by-turn map rendering (#54)
- Driver/rider identity verification and background checks
Assumptions
- Single metro / region unless interviewer asks for multi-city
- Mobile clients with intermittent connectivity — server is source of truth
- Managed geo + messaging infra (Kafka, Redis, RDS) is acceptable
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Create geofences: Define virtual geographic boundaries with metadata
- Real-time trigger: Detect enter, exit, or dwell events
- Event notifications: Fire webhooks/push on geofence triggers
- Bulk management: Support millions of active geofences
- Geofence types: Static, dynamic, time-based
- Dwell detection: Trigger after user stays inside for N seconds
- Geofence groups: Group by category
- Analytics: Track entry/exit counts, dwell time
- Multi-tenant: Support multiple clients
- Low Latency: Geofence check in < 10 ms per device update
- Real-time: Event triggered within 5 seconds
- Scale: 100M+ active geofences, 10M+ devices
- Throughput: 5M location updates/sec at peak
- Accuracy: Handle geofences as small as 50m radius
- Availability: 99.99%
- Durability: No trigger events lost (at-least-once delivery)
- Battery Aware: Minimize GPS polling frequency
| Metric | Calculation | Value |
|---|---|---|
| Active geofences | Given (assumption documented in value) | 100M |
| Active devices | Given (assumption documented in value) | 10M concurrently |
| Location updates / sec | From Location updates / day ÷ 86400 (+ peak factor in value) | 5M peak |
| Avg geofence polygon vertices | Given (typical workload assumption) | 6 |
| Geofence storage per fence | Given | ~500 bytes |
| Total geofence data | 100M × 500B | 50 GB |
| Trigger events / sec | Given | ~50K |
| Event notification delivery | Given (assumption documented in value) | < 5 seconds |
Spatial Index: Two-Phase Filtering
Phase 1: Coarse filtering with spatial index (R-tree or Geohash) Each geofence has a bounding box → R-tree indexes these Query: "Which bounding boxes contain (lat, lng)?" Result: ~10-50 candidate geofences (from 100M → 50) Time: O(log N) Phase 2: Precise point-in-polygon test (ray casting) For each candidate, ray from point to infinity Count intersections with polygon edges Odd → inside; Even → outside Time: O(V) per polygon (~300 operations → microseconds) Total: < 1 ms per location update
Geofence Evaluation Engine: Processing 5M Updates/Sec
Architecture: Flink streaming application Kafka Topic: location-updates (256 partitions by device_id) Flink Job: 1. Consume location updates 2. Query spatial index (R-tree via sidecar) 3. Get candidate geofences (10-50) 4. Point-in-polygon check for each 5. Compare with previous state (Redis lookup) 6. If state changed → emit trigger event to Kafka Scaling: Partition by geohash region, each worker loads only its region's fences
Dwell Detection: Preventing False Triggers
On ENTER: Start a dwell timer (e.g., 60 seconds)
SET dwell_timer:{device}:{fence} 1 EX 60
On each update inside: check if timer expired → trigger DWELL event
On EXIT before expiry: delete timer → pass-through ignored
GPS Jitter handling: Hysteresis buffer
Enter threshold: > 10m inside boundary
Exit threshold: > 10m outside boundary
Dead zone: within 10m → maintain current stateEvent Bus Design (Kafka)
Topic: geofencing_service-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "geofencing_service-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: geofencing_service-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Geofencing Service: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Create Geofence
POST /api/v1/geofences
{
"name": "Downtown Store #42",
"type": "polygon",
"coordinates": [
{"lat": 37.7749, "lng": -122.4194},
{"lat": 37.7759, "lng": -122.4184},
{"lat": 37.7739, "lng": -122.4174},
{"lat": 37.7729, "lng": -122.4184}
],
"triggers": ["enter", "exit", "dwell"],
"dwell_time_seconds": 120,
"metadata": {"store_id": "s-42"},
"webhook_url": "https://client.example.com/geofence-events"
}
Response: 201 Created { "fence_id": "fence-uuid", "status": "active" }Batch Location Update
POST /api/v1/locations/batch
{
"device_id": "device-uuid",
"locations": [
{"lat": 37.7749, "lng": -122.4194, "accuracy": 10, "timestamp": 1710400000},
{"lat": 37.7751, "lng": -122.4190, "accuracy": 8, "timestamp": 1710400005}
]
}Get Geofence Events
GET /api/v1/geofences/{fence_id}/events?start=2025-03-14T00:00:00Z&end=2025-03-14T23:59:59ZQuery Active Devices
GET /api/v1/geofences/{fence_id}/devices?status=inside
→ { "devices": [...], "count": 42 }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 440 Login Timeout: WebSocket session expired; reconnect required
PostgreSQL + PostGIS — Geofence Definitions
CREATE TABLE geofences (
fence_id UUID PRIMARY KEY,
app_id UUID NOT NULL,
name VARCHAR(255),
fence_type ENUM('polygon', 'circle') NOT NULL,
geometry GEOMETRY(Polygon, 4326),
center_lat DECIMAL(10,7),
center_lng DECIMAL(10,7),
radius_meters INT,
triggers JSONB,
dwell_seconds INT DEFAULT 0,
metadata JSONB,
group_id UUID,
webhook_url TEXT,
active BOOLEAN DEFAULT TRUE,
active_hours JSONB,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE INDEX idx_geometry ON geofences USING GIST(geometry);Redis: Device State + Dwell Timers
device_fence:{device_id}:{fence_id} → Hash { state, entered_at }
dwell_timer:{device_id}:{fence_id} → "1" (TTL: dwell_seconds)
device_loc:{device_id} → Hash { lat, lng, accuracy, timestamp }
fence_device_count:{fence_id} → INTClickHouse — Event Analytics
CREATE TABLE geofence_events (
event_id UUID,
fence_id UUID,
device_id UUID,
app_id UUID,
event_type Enum8('enter'=1, 'exit'=2, 'dwell'=3),
timestamp DateTime,
lat Float64,
lng Float64,
dwell_seconds UInt32
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (app_id, fence_id, timestamp);| Concern | Solution |
|---|---|
| Evaluation engine failure | Flink checkpointing; consumer group rebalance |
| Spatial index crash | Rebuild from PostGIS (2-3 min for 100M fences); replicas serve during rebuild |
| Redis state loss | Redis Cluster with AOF persistence |
| Webhook delivery failure | Exponential backoff retry (max 5); DLQ |
| Duplicate triggers | Idempotent processing: dedup by (device, fence, type, 5-min window) |
| GPS accuracy poor | Ignore updates with accuracy > 100m |
Battery Optimization: Adaptive Location Frequency
Far from geofence (>5km): update every 5 min (cell tower) → negligible battery Approaching (1-5km): update every 60 sec (WiFi-assisted) → ~1%/hr Near boundary (<1km): update every 10 sec (GPS) → ~3%/hr Inside fence (dwell): update every 30 sec → ~2%/hr
Client-Side vs Server-Side vs Hybrid
Hybrid Approach ⭐: 1. Server manages all 100M geofences 2. For each device, server identifies 20 nearest fences 3. Push these 20 to client's native geofencing API 4. Client triggers instantly (offline-capable) 5. Client also sends location updates for fences beyond the 20
Interview Walkthrough
- Frame as point-in-polygon or radius check against geofence definitions — batch pre-filter before exact geometry.
- Explain geohash grid indexing to narrow candidate fences before precise intersection test.
- Cover event-driven triggers: location update → match fences → publish enter/exit events to a queue.
- Discuss sync inline evaluation (low latency, limited fences) vs async (millions of fences, seconds delay).
- Mention fence versioning when businesses update boundaries — in-flight trips need consistent fence set.
- Common pitfall: brute-force checking every fence globally — spatial index is non-negotiable at scale.
R-tree vs Geohash Grid vs Quadtree
| Index | Characteristics | Memory | Best For |
|---|---|---|---|
| R-tree ⭐ | Balanced tree of bounding rectangles, great for overlapping polygons | 10 GB for 100M | General purpose, PostGIS |
| Geohash Grid | Divide world into grid cells, simpler, works with Redis | Depends on precision | Distributed cache, simpler deployments |
| Quadtree | Recursive subdivision, adapts to density | Dynamic | Non-uniform distribution (dense + sparse) |
In-Line vs Async Evaluation
In-Line: Synchronous check in request path → lowest latency but can't handle 5M/sec Async (Kafka + Flink) ⭐: Decouple ingestion from evaluation → handles 5M/sec easily For safety-critical: use in-line path for high-priority fences
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core geofencing service flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.