Design an ETA Calculation Service

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design ETA Calculation Service.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Real-time traffic data ingestion, Graph-based routing with live weights, ML-based prediction?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Real-time traffic data ingestion
Graph-based routing with live weights
ML-based prediction
Caching route segments
Capacity estimation with shown math

Out of scope (state explicitly)

Full payment processing (#24)
Turn-by-turn map rendering (#54)
Driver/rider identity verification and background checks

Assumptions

Single metro / region unless interviewer asks for multi-city
Mobile clients with intermittent connectivity — server is source of truth
Managed geo + messaging infra (Kafka, Redis, RDS) is acceptable

Point-to-point ETA: Estimate travel time with current traffic
Multi-stop ETA: Routes with multiple waypoints
ETA for pickup: How long until a driver reaches a pickup point
Historical ETA: "How long does this trip usually take on Tuesday at 8 AM?"
Batch ETA: Compute for N origin-destination pairs efficiently
ETA updates: Continuously update as trip progresses
Multi-modal: Driving, walking, cycling, transit

Metric	Calculation	Value
ETA requests / sec	From ETA requests / day ÷ 86400 (+ peak factor in value)	500K peak
Single ETA latency target	Given (assumption documented in value)	< 100 ms
Batch ETA (20 pairs) target	Given (assumption documented in value)	< 500 ms
Active road segments tracked	Given (assumption documented in value)	50M globally
Traffic updates / sec	From Traffic updates / day ÷ 86400 (+ peak factor in value)	10M GPS data points
Road graph size (in-memory)	Given	~300 GB globally
ETA model features	Given	~50 per query
ETA cache hit rate	Given	~30%

Loading...

ETA Calculation Pipeline: Three-Layer Architecture

Layer 1: Route Engine ETA (physics-based)
  Find route via Contraction Hierarchies (< 1 ms)
  For each segment: time = distance / speed (min of speed_limit, traffic_speed)
  
Layer 2: Traffic-Adjusted ETA
  traffic_speed = Redis GET traffic:{segment_id} → current avg speed
  If no traffic data: use historical avg speed
  Add turn penalties (15-60s per major turn) and signal delays

Layer 3: ML Correction Model ⭐ (secret sauce)
  Gradient Boosted Trees (XGBoost/LightGBM)
  Features: route, traffic, temporal, weather, historical, spatial
  Output: correction_factor
  Final ETA = route_engine_ETA × correction_factor

Traffic Aggregation Pipeline

1. GPS Ingestion (10M points/sec) → Kafka
2. Map Matching (Flink): GPS point → road segment
3. Segment Speed Aggregation (Flink, 60-sec window):
   avg_speed = median(speeds), confidence = min(1.0, sample_count / 10)
4. Store in Redis: HSET traffic:{segment_id} speed {s} confidence {c}
   TTL: 120 seconds
5. Fallback hierarchy: real-time → recent → historical → speed_limit × 0.7

Batch ETA: Matching System Integration

Many-to-One Routing:
  Reverse Dijkstra from destination → find all 20 origins
  ~5 ms for 20 origins (vs 20 ms for separate queries)

ETA Matrix (pre-computed):
  H3 cells (resolution 9 ≈ 175m), pre-compute cell-to-cell ETAs within 30 km
  ~500M pairs × 4 bytes = 2 GB per city → O(1) lookup
  Accuracy ±2-3 min, good enough for initial matching

Refinement: After narrowing to top 3, compute exact ETA with full route + traffic

Event Bus Design (Kafka)

Topic: eta_calculation_service-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "eta_calculation_service-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: eta_calculation_service-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design an ETA Calculation Service: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Single ETA

HTTP

GET /api/v1/eta?origin_lat=37.7749&origin_lng=-122.4194&dest_lat=37.3382&dest_lng=-121.8863&mode=driving&departure_time=now
Response: 200 OK
{
  "eta_seconds": 1920,
  "eta_display": "32 min",
  "distance_meters": 72400,
  "confidence": 0.85,
  "traffic_level": "moderate",
  "breakdown": {
    "route_engine_eta": 1680,
    "traffic_adjustment": 180,
    "ml_correction": 60
  }
}

Batch ETA

HTTP

POST /api/v1/eta/batch
{
  "origins": [
    {"lat": 37.78, "lng": -122.41, "id": "driver-1"},
    {"lat": 37.77, "lng": -122.43, "id": "driver-2"}
  ],
  "destination": {"lat": 37.7749, "lng": -122.4194},
  "mode": "driving"
}
Response: 200 OK
{
  "etas": [
    {"id": "driver-1", "eta_seconds": 240, "distance_meters": 1200},
    {"id": "driver-2", "eta_seconds": 480, "distance_meters": 2100}
  ]
}

ETA Matrix

HTTP

GET /api/v1/eta/matrix?origin_h3=892830926cfffff&dest_h3=89283092e3fffff

Live ETA Update (WebSocket)

JSON

// Server pushes every 60 seconds:
{
  "type": "eta_update",
  "session_id": "nav-uuid",
  "remaining_eta_seconds": 1080,
  "remaining_distance_meters": 35200,
  "traffic_ahead": "moderate",
  "route_changed": false
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
440 Login Timeout: WebSocket session expired; reconnect required

Redis: Live Traffic + ETA Cache

traffic:{segment_id} → Hash { speed, confidence, updated_at } (TTL: 120s)
eta_matrix:{origin_h3}:{dest_h3} → INT (eta_seconds) (TTL: 300s)
eta_cache:{origin_geo6}:{dest_geo6}:{mode}:{hour} → INT (TTL: 300s)
hist_speed:{segment_id}:{dow}:{hour} → FLOAT (no TTL, updated weekly)

ClickHouse — Historical Trip Data

SQL

CREATE TABLE completed_trips (
    trip_id         UUID,
    origin_lat      Float64, origin_lng Float64,
    dest_lat        Float64, dest_lng Float64,
    origin_h3       UInt64, dest_h3 UInt64,
    distance_meters UInt32,
    predicted_eta   UInt32, actual_duration UInt32,
    departure_hour  UInt8, day_of_week UInt8,
    is_holiday      UInt8,
    weather         Enum8('clear'=0,'rain'=1,'snow'=2,'fog'=3),
    route_segments  Array(UInt64),
    mode            Enum8('driving'=0,'walking'=1,'cycling'=2),
    created_at      DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (origin_h3, dest_h3, departure_hour, day_of_week);

ML Model Store

Model: gradient_boosted_trees_v23.pkl
  50M trips trained, 50 features, ~50 MB in memory
  Inference time: < 1 ms per prediction
  Retrained weekly, A/B tested before deployment

Concern	Solution
Traffic service down	Fall back to historical speed patterns
ML model error	Circuit breaker: bypass ML if latency > 50ms or error > 5%
Route engine crash	Multiple replicas; fallback to pre-computed ETA matrix
Stale traffic data	TTL on Redis keys; use historical + confidence=LOW flag
GPS data pipeline lag	Flink checkpointing; if lag > 5 min, switch to historical
ETA cache thundering herd	Singleflight: only compute once for concurrent identical requests

Haversine vs Road Distance vs Routing ETA

Level	Method	Time	Accuracy	Use Case
1	Haversine	< 1 μs	50-200% off	Initial filter (5 km)
2	Road Distance (CH)	< 1 ms	Distance correct, time rough	Rank top 20
3	Routing ETA + Traffic + ML	10-50 ms	±10%	Final ETA to user

XGBoost vs Deep Learning for ETA

XGBoost ⭐:
  - Tabular features → GBT excels (often beats DL on structured data)
  - Inference < 1 ms (vs DL: 5-50 ms)
  - Interpretable feature importance
  - Trains in minutes on 50M samples

When DL wins: sequence modeling (LSTM on GPS trace), GNN for spatial dependencies
Google Maps: GBT for base ETA + GNN for spatial traffic prediction

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

ETA Calculation Pipeline: Three-Layer Architecture

Traffic Aggregation Pipeline

Batch ETA: Matching System Integration

Event Bus Design (Kafka)

Single ETA

Batch ETA

ETA Matrix

Live ETA Update (WebSocket)

Common Error Responses

Redis: Live Traffic + ETA Cache

ClickHouse — Historical Trip Data

ML Model Store

Interview Walkthrough

Live ETA Updates: Smoothing

Time-Dependent Routing

ETA Percentiles

Haversine vs Road Distance vs Routing ETA

XGBoost vs Deep Learning for ETA

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR