Design a Geofencing Service

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Geofencing Service.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Point-in-polygon tests, Geofence event triggers, R-tree indexing?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Point-in-polygon tests
Geofence event triggers
R-tree indexing
Batch vs streaming detection
Capacity estimation with shown math

Out of scope (state explicitly)

Full payment processing (#24)
Turn-by-turn map rendering (#54)
Driver/rider identity verification and background checks

Assumptions

Single metro / region unless interviewer asks for multi-city
Mobile clients with intermittent connectivity — server is source of truth
Managed geo + messaging infra (Kafka, Redis, RDS) is acceptable

Create geofences: Define virtual geographic boundaries with metadata
Real-time trigger: Detect enter, exit, or dwell events
Event notifications: Fire webhooks/push on geofence triggers
Bulk management: Support millions of active geofences
Geofence types: Static, dynamic, time-based
Dwell detection: Trigger after user stays inside for N seconds
Geofence groups: Group by category
Analytics: Track entry/exit counts, dwell time
Multi-tenant: Support multiple clients

Metric	Calculation	Value
Active geofences	Given (assumption documented in value)	100M
Active devices	Given (assumption documented in value)	10M concurrently
Location updates / sec	From Location updates / day ÷ 86400 (+ peak factor in value)	5M peak
Avg geofence polygon vertices	Given (typical workload assumption)	6
Geofence storage per fence	Given	~500 bytes
Total geofence data	100M × 500B	50 GB
Trigger events / sec	Given	~50K
Event notification delivery	Given (assumption documented in value)	< 5 seconds

Loading...

Spatial Index: Two-Phase Filtering

Phase 1: Coarse filtering with spatial index (R-tree or Geohash)
  Each geofence has a bounding box → R-tree indexes these
  Query: "Which bounding boxes contain (lat, lng)?"
  Result: ~10-50 candidate geofences (from 100M → 50)
  Time: O(log N)

Phase 2: Precise point-in-polygon test (ray casting)
  For each candidate, ray from point to infinity
  Count intersections with polygon edges
  Odd → inside; Even → outside
  Time: O(V) per polygon (~300 operations → microseconds)

Total: < 1 ms per location update

Geofence Evaluation Engine: Processing 5M Updates/Sec

Architecture: Flink streaming application

Kafka Topic: location-updates (256 partitions by device_id)

Flink Job:
  1. Consume location updates
  2. Query spatial index (R-tree via sidecar)
  3. Get candidate geofences (10-50)
  4. Point-in-polygon check for each
  5. Compare with previous state (Redis lookup)
  6. If state changed → emit trigger event to Kafka

Scaling: Partition by geohash region, each worker loads only its region's fences

Dwell Detection: Preventing False Triggers

On ENTER: Start a dwell timer (e.g., 60 seconds)
  SET dwell_timer:{device}:{fence} 1 EX 60

On each update inside: check if timer expired → trigger DWELL event
On EXIT before expiry: delete timer → pass-through ignored

GPS Jitter handling: Hysteresis buffer
  Enter threshold: > 10m inside boundary
  Exit threshold: > 10m outside boundary
  Dead zone: within 10m → maintain current state

Event Bus Design (Kafka)

Topic: geofencing_service-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "geofencing_service-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: geofencing_service-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Geofencing Service: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Create Geofence

HTTP

POST /api/v1/geofences
{
  "name": "Downtown Store #42",
  "type": "polygon",
  "coordinates": [
    {"lat": 37.7749, "lng": -122.4194},
    {"lat": 37.7759, "lng": -122.4184},
    {"lat": 37.7739, "lng": -122.4174},
    {"lat": 37.7729, "lng": -122.4184}
  ],
  "triggers": ["enter", "exit", "dwell"],
  "dwell_time_seconds": 120,
  "metadata": {"store_id": "s-42"},
  "webhook_url": "https://client.example.com/geofence-events"
}
Response: 201 Created { "fence_id": "fence-uuid", "status": "active" }

Batch Location Update

HTTP

POST /api/v1/locations/batch
{
  "device_id": "device-uuid",
  "locations": [
    {"lat": 37.7749, "lng": -122.4194, "accuracy": 10, "timestamp": 1710400000},
    {"lat": 37.7751, "lng": -122.4190, "accuracy": 8, "timestamp": 1710400005}
  ]
}

Get Geofence Events

HTTP

GET /api/v1/geofences/{fence_id}/events?start=2025-03-14T00:00:00Z&end=2025-03-14T23:59:59Z

Query Active Devices

HTTP

GET /api/v1/geofences/{fence_id}/devices?status=inside
→ { "devices": [...], "count": 42 }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
440 Login Timeout: WebSocket session expired; reconnect required

PostgreSQL + PostGIS — Geofence Definitions

SQL

CREATE TABLE geofences (
    fence_id        UUID PRIMARY KEY,
    app_id          UUID NOT NULL,
    name            VARCHAR(255),
    fence_type      ENUM('polygon', 'circle') NOT NULL,
    geometry        GEOMETRY(Polygon, 4326),
    center_lat      DECIMAL(10,7),
    center_lng      DECIMAL(10,7),
    radius_meters   INT,
    triggers        JSONB,
    dwell_seconds   INT DEFAULT 0,
    metadata        JSONB,
    group_id        UUID,
    webhook_url     TEXT,
    active          BOOLEAN DEFAULT TRUE,
    active_hours    JSONB,
    created_at      TIMESTAMP,
    updated_at      TIMESTAMP
);
CREATE INDEX idx_geometry ON geofences USING GIST(geometry);

Redis: Device State + Dwell Timers

device_fence:{device_id}:{fence_id} → Hash { state, entered_at }
dwell_timer:{device_id}:{fence_id} → "1" (TTL: dwell_seconds)
device_loc:{device_id} → Hash { lat, lng, accuracy, timestamp }
fence_device_count:{fence_id} → INT

ClickHouse — Event Analytics

SQL

CREATE TABLE geofence_events (
    event_id        UUID,
    fence_id        UUID,
    device_id       UUID,
    app_id          UUID,
    event_type      Enum8('enter'=1, 'exit'=2, 'dwell'=3),
    timestamp       DateTime,
    lat             Float64,
    lng             Float64,
    dwell_seconds   UInt32
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (app_id, fence_id, timestamp);

Concern	Solution
Evaluation engine failure	Flink checkpointing; consumer group rebalance
Spatial index crash	Rebuild from PostGIS (2-3 min for 100M fences); replicas serve during rebuild
Redis state loss	Redis Cluster with AOF persistence
Webhook delivery failure	Exponential backoff retry (max 5); DLQ
Duplicate triggers	Idempotent processing: dedup by (device, fence, type, 5-min window)
GPS accuracy poor	Ignore updates with accuracy > 100m

R-tree vs Geohash Grid vs Quadtree

Index	Characteristics	Memory	Best For
R-tree ⭐	Balanced tree of bounding rectangles, great for overlapping polygons	10 GB for 100M	General purpose, PostGIS
Geohash Grid	Divide world into grid cells, simpler, works with Redis	Depends on precision	Distributed cache, simpler deployments
Quadtree	Recursive subdivision, adapts to density	Dynamic	Non-uniform distribution (dense + sparse)

In-Line vs Async Evaluation

In-Line: Synchronous check in request path → lowest latency but can't handle 5M/sec
Async (Kafka + Flink) ⭐: Decouple ingestion from evaluation → handles 5M/sec easily
  For safety-critical: use in-line path for high-priority fences

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Spatial Index: Two-Phase Filtering

Geofence Evaluation Engine: Processing 5M Updates/Sec

Dwell Detection: Preventing False Triggers

Event Bus Design (Kafka)

Create Geofence

Batch Location Update

Get Geofence Events

Query Active Devices

Common Error Responses

PostgreSQL + PostGIS — Geofence Definitions

Redis: Device State + Dwell Timers

ClickHouse — Event Analytics

Battery Optimization: Adaptive Location Frequency

Client-Side vs Server-Side vs Hybrid

Interview Walkthrough

R-tree vs Geohash Grid vs Quadtree

In-Line vs Async Evaluation

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR