Design a Ride-Hailing Service (Uber) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	The geo + real-time matching problem. Nail S2/H3 indexing, supply/demand dispatch loop, ETA service, surge pricing inputs, and trip state machine with idempotent transitions.
Arch 50	Add driver location streaming at scale (1M updates/sec), batched matching optimization, and fraud (GPS spoofing) detection.
Arch 75	Staff: multi-city failover, marketplace fairness (driver starvation), and how to run surge without price gouging backlash while maintaining supply equilibrium.

Interview Prompt

Design a ride-hailing platform like Uber. Riders request trips, nearby drivers are matched, drivers navigate to pickup and dropoff. Support real-time ETAs, dynamic pricing (surge), and trip lifecycle from request to payment.

Clarifying Questions (ask before designing)

Question	Why it matters
What city scale — riders, drivers, concurrent active trips?	NYC-scale: 100K concurrent trips, 50K online drivers. Drives location update throughput and matching service QPS.
Matching model — nearest driver, or batch optimal assignment?	Greedy nearest is O(log n) with geo index. Batch (Hungarian/min-cost flow) is better globally but adds 2–5 sec latency — use for scheduled rides.
Surge — real-time multiplier or upfront guaranteed price?	Upfront price needs predicted supply/demand at trip end time. Real-time surge is simpler but surprises riders.
Trip state consistency — what happens if driver app crashes mid-trip?	State machine must be server-authoritative with client sync on reconnect; idempotent state transitions.

Scope

In scope

Driver location ingestion and geospatial indexing
Supply/demand matching and dispatch
ETA calculation (pickup and trip)
Surge pricing based on supply/demand ratio
Trip state machine and payment trigger

Out of scope (state explicitly)

Assumptions

Single metro: 500K DAU riders, 50K online drivers peak
Driver location update every 4 sec when online
Matching SLA: offer to driver within 5 sec of request
99.95% availability for active trip tracking

Rider: Request a ride by specifying pickup and dropoff locations
Driver: Go online/offline, accept/decline ride requests, navigate to pickup
Matching: Match riders with nearby available drivers in real-time
ETA: Show estimated time of arrival for pickup and trip
Pricing: Dynamic pricing (surge pricing) based on supply-demand
Real-time tracking: Both rider and driver see each other's live location on a map
Trip lifecycle: Request → Match → Pickup → In-Trip → Dropoff → Payment → Rating
Payment: Charge rider, pay driver (commission model)
Rating: Mutual rating (rider rates driver, driver rates rider)
Ride history: View past trips, receipts

Metric	Calculation	Value
Active drivers at any time	Given (assumption documented in value)	2M
Location updates / sec	2M ÷ 4s	500K
Rides / day	Given (assumption documented in value)	20M
Rides / sec	20M ÷ 86400	~230 (peak 1,000)
Location update size	Given (assumption documented in value)	100 bytes
Location data / day	500K × 100B × 86400	~4 TB

Loading...

Location Service: Handling 500K Location Updates/Sec

Ingestion: Driver app sends GPS coordinates every 4 seconds via WebSocket. Location Service receives and updates the in-memory geospatial index. Also publishes to Kafka driver-location topic for analytics and trip tracking.

Geospatial Index Options:

GeoHash + Redis: Encode lat/lng into a string. Nearby points share a prefix. Simple but Redis single-threaded bottleneck for 500K updates/sec.
QuadTree (⭐ recommended for Uber-scale): Recursively subdivide 2D space into 4 quadrants. Each leaf node contains max ~100 drivers. Entire QuadTree fits in memory (~2M drivers × 64 bytes = 128 MB). Shard by city.
S2 Geometry (Google's approach): Maps earth's surface onto a sphere, then to a cube, then to 6 squares. Each cell has a unique 64-bit Cell ID.
H3 (Uber's Actual Index): Hexagonal grid system. Each hex cell has a unique ID at various resolutions (0-15). Resolution 9 ≈ 0.1 km^2 per cell. Finding neighbors is O(1): hexagons tile evenly.

Matching Service: The Core Algorithm

Rider requests a ride at location (lat, lng)
Find nearby available drivers (within 5 km radius) via geospatial index
Rank candidates by: distance, ETA (accounting for traffic), driver rating, acceptance rate, vehicle type match
Send ride request to top-ranked driver
Driver has 15 seconds to accept/decline. If declined or timed out → send to next driver. If accepted → create trip, notify rider

Consistency: Use distributed lock on driver_id to prevent double-matching: Redis SET driver:lock:{driver_id} {trip_id} NX EX 30

Pricing Service: Surge Pricing

Surge multiplier based on local supply-demand: surge = demand_in_area / supply_in_area. Cap at 8x. Divide city into hexagonal cells (H3 resolution 7). Count riders requesting rides (demand) and available drivers (supply) in each cell. Compute surge per cell every 30 seconds. Store in Redis.

Fare Calculation: fare = base_fare + (per_minute × duration) + (per_mile × distance) + booking_fee + tolls, then multiplied by surge multiplier, floored at minimum_fare.

ETA Service

Uses routing engine (OSRM, Google Maps Directions API, or Uber's in-house)
Pre-computation: For each driver candidate, compute ETA to rider (road distance, not straight line)
Real-time traffic: Aggregate driver speed data across road segments → build live traffic model
ETA for trip: route from pickup to dropoff considering current traffic

Ride Service: Trip State Machine

Manages the full lifecycle: REQUESTED → MATCHED → DRIVER_EN_ROUTE → ARRIVED → IN_TRIP → COMPLETED → PAYMENT_DONE. Cancellation possible at any stage with appropriate refund/fee. Each state transition is a PostgreSQL transaction with optimistic locking. On each transition, publish to Kafka trip-events topic.

Loading...

Payment Service

Triggered on trip completion
Calculate final fare (actual distance × per-mile + actual time × per-min + surge + tolls + booking fee)
Charge rider's card via PSP with idempotency_key = trip_id
Driver payout: Batched weekly via ACH/bank transfer
Split fare: Multiple riders on same trip, split total ÷ N riders

Notification Service

Consumes events from Kafka trip-events and triggers push notifications (APNs for iOS, FCM for Android, SMS fallback)
Real-time tracking during DRIVER_EN_ROUTE and IN_TRIP uses WebSocket (continuous stream, not push notification)

Event Bus Design (Kafka)

Topic: ride_hailing_uber-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "ride_hailing_uber-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: ride_hailing_uber-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Ride-Hailing Service (Uber): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Request Ride

HTTP

POST /api/v1/rides
{
  "rider_id": "user-uuid",
  "pickup": {"lat": 37.7749, "lng": -122.4194, "address": "..."},
  "dropoff": {"lat": 37.7849, "lng": -122.4094, "address": "..."},
  "ride_type": "UberX"
}
Response: 201 Created
{
  "ride_id": "ride-uuid",
  "status": "matching",
  "estimated_fare": {"min": 12.50, "max": 16.00, "surge": 1.2},
  "estimated_pickup_eta": "4 min"
}

Driver Accept/Decline

HTTP

POST /api/v1/rides/{ride_id}/accept
POST /api/v1/rides/{ride_id}/decline

Update Driver Location (WebSocket)

JSON

{
  "type": "location_update",
  "driver_id": "driver-uuid",
  "lat": 37.7752,
  "lng": -122.4190,
  "heading": 45,
  "speed": 25,
  "timestamp": 1710320000
}

Get Ride Status

HTTP

GET /api/v1/rides/{ride_id}
Response: 200 OK
{
  "ride_id": "...",
  "status": "in_trip",
  "driver": {"name": "...", "vehicle": "...", "rating": 4.9, "location": {...}},
  "pickup": {...},
  "dropoff": {...},
  "eta_to_dropoff": "12 min"
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
440 Login Timeout: WebSocket session expired; reconnect required

PostgreSQL: Trips (ACID required for financial data)

SQL

CREATE TABLE trips (
    trip_id         UUID PRIMARY KEY,
    rider_id        UUID NOT NULL,
    driver_id       UUID,
    status          ENUM('matching','driver_assigned','en_route_pickup',
                         'arrived','in_trip','completed','cancelled'),
    ride_type       VARCHAR(20),
    pickup_lat      DECIMAL(10,7),
    pickup_lng      DECIMAL(10,7),
    dropoff_lat     DECIMAL(10,7),
    dropoff_lng     DECIMAL(10,7),
    pickup_address  TEXT,
    dropoff_address TEXT,
    surge_multiplier DECIMAL(3,1),
    estimated_fare  DECIMAL(10,2),
    actual_fare     DECIMAL(10,2),
    distance_miles  DECIMAL(8,2),
    duration_minutes DECIMAL(8,2),
    started_at      TIMESTAMP,
    completed_at    TIMESTAMP,
    created_at      TIMESTAMP,
    INDEX idx_rider (rider_id, created_at DESC),
    INDEX idx_driver (driver_id, created_at DESC)
);

Redis: Driver Availability & Location

# Geospatial index
GEOADD drivers:available:{city} {lng} {lat} {driver_id}

# Driver state
Key:    driver:state:{driver_id}
Value:  Hash { status: "available|on_trip|offline", trip_id, lat, lng, heading, updated_at }

# Surge pricing
Key:    surge:{city}:{h3_cell_id}
Value:  1.5
TTL:    60 seconds (refreshed every 30s)

# Driver lock (prevent double-matching)
Key:    driver:lock:{driver_id}
Value:  trip_id
TTL:    30 seconds

Kafka Topics

Topic: driver-locations       (high throughput, partitioned by driver_id)
Topic: trip-events            (lifecycle events: created, matched, started, completed)
Topic: surge-updates          (pricing updates per cell)

Cassandra: Location History

SQL

CREATE TABLE trip_location_trail (
    trip_id     UUID,
    timestamp   TIMESTAMP,
    lat         DECIMAL,
    lng         DECIMAL,
    speed       FLOAT,
    PRIMARY KEY (trip_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC)
  AND default_time_to_live = 2592000;  -- 30 days

Concern	Solution
Matching service failure	Retry matching; rider sees "looking for driver..." until timeout
Location service lag	Stale driver locations (within 30s) are still usable for matching
Payment failure	Retry with idempotency key; fallback to post-trip charging
Driver app crash	Trip state persisted in PostgreSQL; driver reconnects and resumes
Network partition	Driver/rider continue trip offline; sync state when reconnected
Double matching	Redis distributed lock with NX flag ensures atomic driver assignment

What Happens if a Matched Driver Goes Offline?

Driver doesn't send heartbeat for 60 seconds
System detects driver offline → trip status = "driver_unavailable"
Auto-reassign: Re-enter matching for rider with priority boost
Notify rider: "Finding a new driver..."
Original driver penalized (lower acceptance rate score)

Geospatial Index: QuadTree vs GeoHash vs H3 vs S2

Uber chose H3 because: hexagons tile uniformly (every neighbor is equidistant from center: circles approximate better with hexagons), multi-resolution (Res 9 for matching, Res 7 for surge zones), O(1) neighbor finding via h3.k_ring, and hierarchical containment for natural aggregation.

Why WebSocket for Driver Location (Not HTTP Polling)?

HTTP polling every 4 seconds wastes 120 GB/min for 2M drivers (4 KB overhead per poll). WebSocket uses ~100 bytes per update = 3 GB/min (40x less). WebSocket also enables real-time server-push for ride request delivery to drivers.

Matching Algorithm: Greedy vs Batch Optimization

Greedy assigns immediately to nearest driver → lowest latency but globally suboptimal. Batch Optimization collects requests in a 5-second window and solves the assignment problem (Hungarian algorithm) → globally optimal but higher latency. Uber's actual approach: hybrid: low demand uses greedy, high demand uses batch optimization with 2-second windows.

Why PostgreSQL for Trips (Not Cassandra/DynamoDB)?

Trips require ACID transactions (state machine transitions must be atomic), complex queries (JOINs, aggregations, window functions), foreign key relationships, and row-level locking for concurrent updates. PostgreSQL provides all of these. Cassandra has no transactions or JOINs. DynamoDB has limited transaction scope and query flexibility.

Surge Pricing: Ethical and Technical Considerations

Technically: demand/supply ratio per H3 cell, capped at 8x. Ethically: auto-cap at 1.0 during declared emergencies, show surge multiplier BEFORE booking, provide wait-time option for non-surge rides. Surge pricing is essential to incentivize drivers to high-demand areas and reduce demand to balance supply/demand.

Driver Location Update: Accuracy vs Battery Life vs Bandwidth

Every 4 seconds (⭐ Uber's actual interval): good balance: position error ~10-40m, battery ~5% per hour, bandwidth 90 KB/hour. Adaptive approach: stationary (30s), slow moving (10s), normal (4s), in active ride (2s).

SLOs & Error Budgets

Metric	Target	Rationale
Match time p99	< 5 sec	Rider waits — direct UX impact
ETA accuracy p90	± 2 min	Trust in upfront pricing
Active trip tracking availability	99.95%	Safety and billing depend on it
Location ingest lag p99	< 2 sec	Stale location → wrong match

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
GPS spoofing — fake drivers flooding a surge zone	Driver count in cell jumps 10× without corresponding trip completions; impossible movement speeds in location stream	Velocity sanity checks; IMU/accelerometer cross-validation on device; shadow-ban spoofed accounts; exclude from surge calculation
Matching service deadlock — all drivers show BUSY but trips completed	Available driver count near zero; trip completion events not releasing driver status	Reaper job: reset drivers stuck BUSY > 2 hours without active trip; fix state transition bug; manual ops dashboard to force-release
Surge multiplier stuck at 3× after event ends	Rider complaints; surge counter not decaying; supply/demand ratio shows equilibrium but multiplier high	TTL on surge keys (auto-expire after 5 min without refresh); manual zone reset via ops console; postmortem on Flink aggregation lag

Cost Drivers (Staff lens)

Location ingest: 50K drivers × 0.25 updates/sec × 200 bytes = ~2.5 GB/hr Kafka — manageable; storage for 90-day replay is costly
ETA graph queries: cache hit ratio is the cost lever — each 10% miss increase adds ~1K QPS to graph cluster
Push notifications for driver offers: 500K offers/day × $0.001/push — minor vs driver incentives

Multi-Region & DR

Each metro is a shard (natural isolation). Trip data stays in region for compliance. Cross-region only for roving riders (rare) — route to home region API. Active-active for read-heavy ETA cache; matching is single-region per request.