Design a Real-time Vehicle Tracking System

Interview Prompt

Design Real-time Vehicle Tracking System.

Clarifying Questions (ask before designing)

Question	Why it matters
What city scale — concurrent trips, drivers, location update rate?	Drives geo-index choice, matching QPS, and streaming ingestion throughput.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

High-frequency location ingestion
Geospatial pub/sub
Trajectory compression
Map matching
Historical replay
Capacity estimation with shown math

Out of scope (state explicitly)

Full payment processing (#24)
Turn-by-turn map rendering (#54)
Driver/rider identity verification and background checks

Assumptions

Single metro / region unless interviewer asks for multi-city
Mobile clients with intermittent connectivity — server is source of truth
Managed geo + messaging infra (Kafka, Redis, RDS) is acceptable

Live tracking: Display real-time location of vehicles on a map (fleet management, delivery tracking)
Location ingestion: Ingest GPS coordinates from thousands/millions of vehicles at configurable intervals
Trip tracking: Track active trips with start, waypoints, and end; record full route trail
Geofence alerts: Trigger alerts when vehicles enter/exit defined zones
Historical playback: Replay a vehicle's route over any past time period
Speed/idle alerts: Detect speeding, excessive idling, harsh braking events
Fleet dashboard: Real-time overview of entire fleet: active, idle, offline vehicles
ETA for deliveries: Show customer the live position and ETA of their delivery
Multi-tenant: Support multiple fleet operators on the same platform

Metric	Calculation	Value
Active vehicles	Given (assumption documented in value)	10M
Avg location update interval	Given (typical workload assumption)	5 seconds
Location updates / sec	From Location updates / day ÷ 86400 (+ peak factor in value)	2M
Location point size	Given (assumption documented in value)	100 bytes
Raw data / day	2M × 100B × 86400	~17 TB
Concurrent dashboard viewers	Given (peak load assumption)	500K
WebSocket connections	Given (assumption documented in value)	500K (dashboard) + 10M (vehicles)
Historical queries / sec	From Historical queries / day ÷ 86400 (+ peak factor in value)	10K

Vehicles stream GPS data via MQTT to a broker cluster, which bridges to Kafka. Flink processes the stream, updating Redis (latest position) and TimescaleDB (trail). Dashboards receive real-time updates via WebSocket subscribed to Redis Pub/Sub.

Loading...

Connection Gateway: Handling 10M Persistent Connections

MQTT vs HTTP for vehicle GPS: HTTP has ~4 KB overhead per update × 12 updates/min × 10M vehicles ≈ 690 TB/day wasted bandwidth. MQTT ⭐: Persistent connection with ~20 bytes header + 100 bytes payload = 120 bytes per update. For 10M vehicles: ~21 TB/day (~34× less!). QoS 1 ensures at-least-once delivery. Built-in reconnection handling for tunnels/rural areas. Last Will Testament detects vehicle offline.

MQTT Broker Cluster (EMQX/VerneMQ): Each broker handles ~200K connections. 10M vehicles → 50 brokers. Clustered with shared subscriptions and session persistence.

Location Processor: Updating Latest Position in Redis

Flink streaming job consumes from Kafka "vehicle-locations": validates (lat/lng range, speed < 300 km/h), de-duplicates, enriches with fleet info from device registry, updates Redis (HSET with 5-min TTL), and publishes to Redis Pub/Sub channel for real-time dashboard updates.

Movement Status Detection: speed > 5 km/h → "moving"; speed < 5 km/h for < 5 min → "idle"; speed < 2 km/h for > 5 min → "parked"; no update for > 5 min → "offline" (TTL expiry).

Trail Writer: Persisting Location History

Storage choice: TimescaleDB for hot (90 days): time-based partitioning, 10-20× compression, SQL queries, PostGIS integration, continuous aggregates. ClickHouse for cold analytics (2+ years): daily summaries.

Batch writing: Flink accumulates updates → batch write to TimescaleDB every 5 seconds. 2M/sec × 5s = 10M rows per batch, sharded across 4 nodes = 2.5M rows per shard. TimescaleDB handles 2.5M bulk insert in < 1 second.

Dashboard: Real-Time Fleet View

Fleet manager connects via WebSocket → server subscribes to Redis Pub/Sub for the fleet. Initial load: SCAN Redis for all vehicles in fleet. Ongoing: Redis Pub/Sub pushes updates → server forwards to client. Server-side viewport filtering sends only vehicles in the current map bounds. Client-side clustering at low zoom levels (Supercluster library).

Event Bus Design (Kafka)

Topic: realtime_vehicle_tracking-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "realtime_vehicle_tracking-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: realtime_vehicle_tracking-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Real-time Vehicle Tracking System: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Send Location Update (MQTT)

MQTT Topic: vehicles/{vehicle_id}/location  QoS: 1
Payload (protobuf):
{
  "vehicle_id": "v-uuid",
  "lat": 37.7749, "lng": -122.4194,
  "speed": 45.3, "heading": 180,
  "timestamp": 1710400000,
  "fuel_level": 0.65,
  "engine_status": "on",
  "events": ["harsh_brake"]
}

Get Vehicle Current Location

HTTP

GET /api/v1/vehicles/{vehicle_id}/location
Response: 200 OK
{
  "vehicle_id": "v-uuid",
  "lat": 37.7749, "lng": -122.4194,
  "speed": 45.3, "heading": 180,
  "status": "moving",
  "last_updated": "2025-03-14T10:23:45Z",
  "driver": {"name": "John", "phone": "+1..."}
}

Get Vehicle Route History

HTTP

GET /api/v1/vehicles/{vehicle_id}/trail?start=2025-03-14T08:00:00Z&end=2025-03-14T18:00:00Z&simplify=true

Fleet Overview

HTTP

GET /api/v1/fleets/{fleet_id}/overview
Response: 200 OK
{
  "fleet_id": "f-uuid",
  "total_vehicles": 5000,
  "moving": 3200,
  "idle": 800,
  "parked": 700,
  "offline": 300,
  "alerts_active": 12
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
440 Login Timeout: WebSocket session expired; reconnect required

Redis: Latest Vehicle State

vehicle:{vehicle_id}  → Hash { lat, lng, speed, heading, status, fleet_id, driver_id, fuel_level, last_updated }
TTL: 300 (offline detection)
fleet_vehicles:{fleet_id}  → SET of vehicle_ids
fleet_stats:{fleet_id}:moving  → INT (atomic counter)
alerts:{vehicle_id}  → LIST of active alert JSONs

TimescaleDB: Location Trail (90 Days)

SQL

CREATE TABLE location_points (
    vehicle_id      UUID NOT NULL,
    timestamp       TIMESTAMPTZ NOT NULL,
    lat             DOUBLE PRECISION,
    lng             DOUBLE PRECISION,
    speed           REAL,
    heading         SMALLINT,
    altitude        REAL,
    status          TEXT,
    odometer        REAL,
    fuel_level      REAL,
    events          JSONB
);

SELECT create_hypertable('location_points', 'timestamp',
    chunk_time_interval => INTERVAL '1 day',
    partitioning_column => 'vehicle_id',
    number_partitions => 4);

ALTER TABLE location_points SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'vehicle_id',
    timescaledb.compress_orderby = 'timestamp DESC'
);

Kafka Topics

Topic: vehicle-locations  (512 partitions, 48h retention)
  Key: vehicle_id  Value: protobuf { lat, lng, speed, heading, timestamp, events[] }

Topic: vehicle-alerts  (64 partitions)
Topic: vehicle-status-changes  (64 partitions)

Concern	Solution
MQTT broker failure	Cluster with session handoff; QoS 1 ensures no message loss
Kafka lag	Consumer group rebalancing; auto-scale Flink parallelism
Redis failure	Redis Cluster (6+ nodes); AOF persistence
TimescaleDB write failure	Kafka retains data (48h); replay after DB recovery
Vehicle goes offline	TTL-based detection; alert fleet manager; last known position preserved
GPS drift/spoofing	Validate speed vs distance between points; reject impossible

Handling Vehicle GPS Blackout (Tunnel)

Vehicle buffers timestamps with gaps during signal loss. When GPS recovers, sends batch with gap indicator. Server detects 5+ minute gap, marks status = "GPS_LOST" (not offline: MQTT may still be alive), shows icon with "?". When GPS resumes, interpolates the gap (straight-line or map-matched). Edge case: OBD-II fallback via cell tower triangulation (~200m accuracy).

Out-of-Order GPS Points

Redis: use Lua script to only update if incoming timestamp > stored timestamp. TimescaleDB: insert all points regardless of order, ORDER BY timestamp for chronological queries, ON CONFLICT DO NOTHING for idempotency. Flink: use event-time processing with watermarks for speed alerts.

MQTT vs WebSocket vs gRPC for Vehicle Communication

Protocol	Overhead	Best For
MQTT ⭐	2-4 bytes per message	IoT devices with constrained bandwidth
WebSocket	2-14 bytes per frame	Browser-based dashboards
gRPC	HTTP/2 + protobuf	Service-to-service, mobile apps

Decision: Vehicles → Server: MQTT. Dashboard → Server: WebSocket. Internal services: gRPC.

Location Storage: Raw Points vs Compressed Trails

Raw points: 17 TB/day.

Douglas-Peucker simplification: 50% reduction (highway: 70%, city: 30%).

Delta encoding: 10-15× compression on top of TimescaleDB.

Hybrid ⭐: Real-time (last 24h): raw points. Recent (1-90 days): TimescaleDB compression (10×). Cold (90+ days): ClickHouse aggregated segments (100×). Total: ~170 TB vs 1.5 PB uncompressed.

Scalability: 10M Vehicles at 5-Second Intervals

2M updates/sec: 50 MQTT broker nodes, 18 Kafka brokers (RF=3, 512 partitions), 25 Flink TaskManagers, 20 Redis shards (5 GB memory), 4 TimescaleDB shards, 10 dashboard WebSocket servers.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Connection Gateway: Handling 10M Persistent Connections

Location Processor: Updating Latest Position in Redis

Trail Writer: Persisting Location History

Dashboard: Real-Time Fleet View

Event Bus Design (Kafka)

Send Location Update (MQTT)

Get Vehicle Current Location

Get Vehicle Route History

Fleet Overview

Common Error Responses

Redis: Latest Vehicle State

TimescaleDB: Location Trail (90 Days)

Kafka Topics

Handling Vehicle GPS Blackout (Tunnel)

Out-of-Order GPS Points

Interview Walkthrough

MQTT vs WebSocket vs gRPC for Vehicle Communication

Location Storage: Raw Points vs Compressed Trails

Scalability: 10M Vehicles at 5-Second Intervals

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR