This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | The geo + real-time matching problem. Nail S2/H3 indexing, supply/demand dispatch loop, ETA service, surge pricing inputs, and trip state machine with idempotent transitions. |
| Arch 50 | Add driver location streaming at scale (1M updates/sec), batched matching optimization, and fraud (GPS spoofing) detection. |
| Arch 75 | Staff: multi-city failover, marketplace fairness (driver starvation), and how to run surge without price gouging backlash while maintaining supply equilibrium. |
Interview Prompt
Design a ride-hailing platform like Uber. Riders request trips, nearby drivers are matched, drivers navigate to pickup and dropoff. Support real-time ETAs, dynamic pricing (surge), and trip lifecycle from request to payment.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What city scale — riders, drivers, concurrent active trips? | NYC-scale: 100K concurrent trips, 50K online drivers. Drives location update throughput and matching service QPS. |
| Matching model — nearest driver, or batch optimal assignment? | Greedy nearest is O(log n) with geo index. Batch (Hungarian/min-cost flow) is better globally but adds 2–5 sec latency — use for scheduled rides. |
| Surge — real-time multiplier or upfront guaranteed price? | Upfront price needs predicted supply/demand at trip end time. Real-time surge is simpler but surprises riders. |
| Trip state consistency — what happens if driver app crashes mid-trip? | State machine must be server-authoritative with client sync on reconnect; idempotent state transitions. |
Scope
In scope
- Driver location ingestion and geospatial indexing
- Supply/demand matching and dispatch
- ETA calculation (pickup and trip)
- Surge pricing based on supply/demand ratio
- Trip state machine and payment trigger
Out of scope (state explicitly)
- Full payment processing (#24)
- In-app navigation map rendering (#54)
- Driver onboarding and background checks
- Food delivery multi-stop (#21)
Assumptions
- Single metro: 500K DAU riders, 50K online drivers peak
- Driver location update every 4 sec when online
- Matching SLA: offer to driver within 5 sec of request
- 99.95% availability for active trip tracking
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Rider: Request a ride by specifying pickup and dropoff locations
- Driver: Go online/offline, accept/decline ride requests, navigate to pickup
- Matching: Match riders with nearby available drivers in real-time
- ETA: Show estimated time of arrival for pickup and trip
- Pricing: Dynamic pricing (surge pricing) based on supply-demand
- Real-time tracking: Both rider and driver see each other's live location on a map
- Trip lifecycle: Request → Match → Pickup → In-Trip → Dropoff → Payment → Rating
- Payment: Charge rider, pay driver (commission model)
- Rating: Mutual rating (rider rates driver, driver rates rider)
- Ride history: View past trips, receipts
- Low Latency: Match rider with driver in < 5 seconds
- High Availability: 99.99%: downtime means stranded riders
- Real-time: Location updates every 3-5 seconds from all active drivers
- Scalability: 100M+ riders, 5M+ drivers, 20M+ rides/day
- Consistency: Ride-to-driver matching must be exactly one-to-one (no double booking)
- Geo-distributed: Multi-city, multi-country
- Fault Tolerant: A matched ride must never be lost
| Metric | Calculation | Value |
|---|---|---|
| Active drivers at any time | Given (assumption documented in value) | 2M |
| Location updates / sec | 2M ÷ 4s | 500K |
| Rides / day | Given (assumption documented in value) | 20M |
| Rides / sec | 20M ÷ 86400 | ~230 (peak 1,000) |
| Location update size | Given (assumption documented in value) | 100 bytes |
| Location data / day | 500K × 100B × 86400 | ~4 TB |
Location Service: Handling 500K Location Updates/Sec
Ingestion: Driver app sends GPS coordinates every 4 seconds via WebSocket. Location Service receives and updates the in-memory geospatial index. Also publishes to Kafka driver-location topic for analytics and trip tracking.
Geospatial Index Options:
- GeoHash + Redis: Encode lat/lng into a string. Nearby points share a prefix. Simple but Redis single-threaded bottleneck for 500K updates/sec.
- QuadTree (⭐ recommended for Uber-scale): Recursively subdivide 2D space into 4 quadrants. Each leaf node contains max ~100 drivers. Entire QuadTree fits in memory (~2M drivers × 64 bytes = 128 MB). Shard by city.
- S2 Geometry (Google's approach): Maps earth's surface onto a sphere, then to a cube, then to 6 squares. Each cell has a unique 64-bit Cell ID.
- H3 (Uber's Actual Index): Hexagonal grid system. Each hex cell has a unique ID at various resolutions (0-15). Resolution 9 ≈ 0.1 km^2 per cell. Finding neighbors is O(1): hexagons tile evenly.
Matching Service: The Core Algorithm
- Rider requests a ride at location (lat, lng)
- Find nearby available drivers (within 5 km radius) via geospatial index
- Rank candidates by: distance, ETA (accounting for traffic), driver rating, acceptance rate, vehicle type match
- Send ride request to top-ranked driver
- Driver has 15 seconds to accept/decline. If declined or timed out → send to next driver. If accepted → create trip, notify rider
Consistency: Use distributed lock on driver_id to prevent double-matching: Redis SET driver:lock:{driver_id} {trip_id} NX EX 30
Pricing Service: Surge Pricing
Surge multiplier based on local supply-demand: surge = demand_in_area / supply_in_area. Cap at 8x. Divide city into hexagonal cells (H3 resolution 7). Count riders requesting rides (demand) and available drivers (supply) in each cell. Compute surge per cell every 30 seconds. Store in Redis.
Fare Calculation: fare = base_fare + (per_minute × duration) + (per_mile × distance) + booking_fee + tolls, then multiplied by surge multiplier, floored at minimum_fare.
ETA Service
- Uses routing engine (OSRM, Google Maps Directions API, or Uber's in-house)
- Pre-computation: For each driver candidate, compute ETA to rider (road distance, not straight line)
- Real-time traffic: Aggregate driver speed data across road segments → build live traffic model
- ETA for trip: route from pickup to dropoff considering current traffic
Ride Service: Trip State Machine
Manages the full lifecycle: REQUESTED → MATCHED → DRIVER_EN_ROUTE → ARRIVED → IN_TRIP → COMPLETED → PAYMENT_DONE. Cancellation possible at any stage with appropriate refund/fee. Each state transition is a PostgreSQL transaction with optimistic locking. On each transition, publish to Kafka trip-events topic.
Payment Service
- Triggered on trip completion
- Calculate final fare (actual distance × per-mile + actual time × per-min + surge + tolls + booking fee)
- Charge rider's card via PSP with idempotency_key = trip_id
- Driver payout: Batched weekly via ACH/bank transfer
- Split fare: Multiple riders on same trip, split total ÷ N riders
Notification Service
- Consumes events from Kafka trip-events and triggers push notifications (APNs for iOS, FCM for Android, SMS fallback)
- Real-time tracking during DRIVER_EN_ROUTE and IN_TRIP uses WebSocket (continuous stream, not push notification)
Event Bus Design (Kafka)
Topic: ride_hailing_uber-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "ride_hailing_uber-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: ride_hailing_uber-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Ride-Hailing Service (Uber): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Request Ride
POST /api/v1/rides
{
"rider_id": "user-uuid",
"pickup": {"lat": 37.7749, "lng": -122.4194, "address": "..."},
"dropoff": {"lat": 37.7849, "lng": -122.4094, "address": "..."},
"ride_type": "UberX"
}
Response: 201 Created
{
"ride_id": "ride-uuid",
"status": "matching",
"estimated_fare": {"min": 12.50, "max": 16.00, "surge": 1.2},
"estimated_pickup_eta": "4 min"
}Driver Accept/Decline
POST /api/v1/rides/{ride_id}/accept
POST /api/v1/rides/{ride_id}/declineUpdate Driver Location (WebSocket)
{
"type": "location_update",
"driver_id": "driver-uuid",
"lat": 37.7752,
"lng": -122.4190,
"heading": 45,
"speed": 25,
"timestamp": 1710320000
}Get Ride Status
GET /api/v1/rides/{ride_id}
Response: 200 OK
{
"ride_id": "...",
"status": "in_trip",
"driver": {"name": "...", "vehicle": "...", "rating": 4.9, "location": {...}},
"pickup": {...},
"dropoff": {...},
"eta_to_dropoff": "12 min"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 440 Login Timeout: WebSocket session expired; reconnect required
PostgreSQL: Trips (ACID required for financial data)
CREATE TABLE trips (
trip_id UUID PRIMARY KEY,
rider_id UUID NOT NULL,
driver_id UUID,
status ENUM('matching','driver_assigned','en_route_pickup',
'arrived','in_trip','completed','cancelled'),
ride_type VARCHAR(20),
pickup_lat DECIMAL(10,7),
pickup_lng DECIMAL(10,7),
dropoff_lat DECIMAL(10,7),
dropoff_lng DECIMAL(10,7),
pickup_address TEXT,
dropoff_address TEXT,
surge_multiplier DECIMAL(3,1),
estimated_fare DECIMAL(10,2),
actual_fare DECIMAL(10,2),
distance_miles DECIMAL(8,2),
duration_minutes DECIMAL(8,2),
started_at TIMESTAMP,
completed_at TIMESTAMP,
created_at TIMESTAMP,
INDEX idx_rider (rider_id, created_at DESC),
INDEX idx_driver (driver_id, created_at DESC)
);Redis: Driver Availability & Location
# Geospatial index
GEOADD drivers:available:{city} {lng} {lat} {driver_id}
# Driver state
Key: driver:state:{driver_id}
Value: Hash { status: "available|on_trip|offline", trip_id, lat, lng, heading, updated_at }
# Surge pricing
Key: surge:{city}:{h3_cell_id}
Value: 1.5
TTL: 60 seconds (refreshed every 30s)
# Driver lock (prevent double-matching)
Key: driver:lock:{driver_id}
Value: trip_id
TTL: 30 secondsKafka Topics
Topic: driver-locations (high throughput, partitioned by driver_id) Topic: trip-events (lifecycle events: created, matched, started, completed) Topic: surge-updates (pricing updates per cell)
Cassandra: Location History
CREATE TABLE trip_location_trail (
trip_id UUID,
timestamp TIMESTAMP,
lat DECIMAL,
lng DECIMAL,
speed FLOAT,
PRIMARY KEY (trip_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC)
AND default_time_to_live = 2592000; -- 30 days| Concern | Solution |
|---|---|
| Matching service failure | Retry matching; rider sees "looking for driver..." until timeout |
| Location service lag | Stale driver locations (within 30s) are still usable for matching |
| Payment failure | Retry with idempotency key; fallback to post-trip charging |
| Driver app crash | Trip state persisted in PostgreSQL; driver reconnects and resumes |
| Network partition | Driver/rider continue trip offline; sync state when reconnected |
| Double matching | Redis distributed lock with NX flag ensures atomic driver assignment |
What Happens if a Matched Driver Goes Offline?
- Driver doesn't send heartbeat for 60 seconds
- System detects driver offline → trip status = "driver_unavailable"
- Auto-reassign: Re-enter matching for rider with priority boost
- Notify rider: "Finding a new driver..."
- Original driver penalized (lower acceptance rate score)
Geofencing
- Define virtual boundaries (airports, stadiums, city limits)
- When driver enters a geofence → auto-queue for airport rides
- When rider requests from a geofence → apply special pricing/rules
- H3 hexagons make geofence membership checks efficient
Ride Pooling (UberPool / Share)
- Match multiple riders with similar routes
- Algorithm: For each new request, check if existing in-progress rides have a detour < 5 minutes
- Computationally expensive: requires real-time route comparison
- Riders get discounted fare; driver gets paid for the full route
Safety Features
- Trip sharing: Rider shares trip status with trusted contacts
- Emergency button: Direct call to emergency services with live location
- Driver identity verification: Periodic selfie check
- Route deviation alerts: If driver deviates significantly from planned route
Analytics & Monitoring
- Real-time dashboards: rides in progress, available drivers per city, surge heatmap
- Demand forecasting: ML model predicts demand per area per hour → proactively incentivize drivers to position there
Interview Walkthrough
- Clarify the match flow: rider request → geospatial driver search → dispatch → trip tracking as distinct subsystems.
- Explain geospatial indexing (geohash, S2, or quadtree) for nearby-driver queries with Interview Patterns for read-heavy location data.
- Cover driver location updates via WebSocket at adaptive frequency — balance GPS accuracy against battery and bandwidth.
- Discuss surge pricing as a separate real-time supply/demand signal, not inline in the match path.
- Mention ETA calculation as a precomputed or ML-cached lookup decoupled from the dispatch critical path.
- Common pitfall: scanning all online drivers globally — geospatial partition by city/region is mandatory at scale.
Geospatial Index: QuadTree vs GeoHash vs H3 vs S2
Uber chose H3 because: hexagons tile uniformly (every neighbor is equidistant from center: circles approximate better with hexagons), multi-resolution (Res 9 for matching, Res 7 for surge zones), O(1) neighbor finding via h3.k_ring, and hierarchical containment for natural aggregation.
Why WebSocket for Driver Location (Not HTTP Polling)?
HTTP polling every 4 seconds wastes 120 GB/min for 2M drivers (4 KB overhead per poll). WebSocket uses ~100 bytes per update = 3 GB/min (40x less). WebSocket also enables real-time server-push for ride request delivery to drivers.
Matching Algorithm: Greedy vs Batch Optimization
Greedy assigns immediately to nearest driver → lowest latency but globally suboptimal. Batch Optimization collects requests in a 5-second window and solves the assignment problem (Hungarian algorithm) → globally optimal but higher latency. Uber's actual approach: hybrid: low demand uses greedy, high demand uses batch optimization with 2-second windows.
Why PostgreSQL for Trips (Not Cassandra/DynamoDB)?
Trips require ACID transactions (state machine transitions must be atomic), complex queries (JOINs, aggregations, window functions), foreign key relationships, and row-level locking for concurrent updates. PostgreSQL provides all of these. Cassandra has no transactions or JOINs. DynamoDB has limited transaction scope and query flexibility.
Surge Pricing: Ethical and Technical Considerations
Technically: demand/supply ratio per H3 cell, capped at 8x. Ethically: auto-cap at 1.0 during declared emergencies, show surge multiplier BEFORE booking, provide wait-time option for non-surge rides. Surge pricing is essential to incentivize drivers to high-demand areas and reduce demand to balance supply/demand.
Driver Location Update: Accuracy vs Battery Life vs Bandwidth
Every 4 seconds (⭐ Uber's actual interval): good balance: position error ~10-40m, battery ~5% per hour, bandwidth 90 KB/hour. Adaptive approach: stationary (30s), slow moving (10s), normal (4s), in active ride (2s).
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Single-city MVP
Monolith with PostGIS for driver locations (ST_DWithin queries). Simple nearest-driver greedy match. Fixed pricing. Trip states in PostgreSQL. Polling for location updates every 10 sec.
Key components: Monolith · PostGIS · PostgreSQL trip store · Greedy matching · Fixed pricing
Move to next phase when: PostGIS nearby query p99 > 500ms at 5K drivers; match timeout rate > 15%
Phase 2 — Real-time geo index
H3 cell index in Redis. WebSocket location streaming (4 sec interval). Dedicated Matching + ETA services. Kafka trip events. Surge pricing by zone. Push-based driver offers.
Key components: H3 + Redis geo index · WebSocket ingest · ETA graph service · Surge service · Kafka events
Move to next phase when: Surge needed during rush hour; PostGIS can't handle 12K location updates/sec
Phase 3 — Multi-city platform
City-sharded matching (each metro isolated). ML-enhanced ETA. Batch matching for high-demand events (stadium exit). Multi-region active-active for trip tracking. Driver repositioning nudges based on demand forecast.
Key components: City sharding · ML ETA · Batch matching · Demand forecasting · Multi-region trip store
Move to next phase when: Cross-city trip data residency requirements; stadium event with 5K simultaneous requests
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Match time p99 | < 5 sec | Rider waits — direct UX impact |
| ETA accuracy p90 | ± 2 min | Trust in upfront pricing |
| Active trip tracking availability | 99.95% | Safety and billing depend on it |
| Location ingest lag p99 | < 2 sec | Stale location → wrong match |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| GPS spoofing — fake drivers flooding a surge zone | Driver count in cell jumps 10× without corresponding trip completions; impossible movement speeds in location stream | Velocity sanity checks; IMU/accelerometer cross-validation on device; shadow-ban spoofed accounts; exclude from surge calculation |
| Matching service deadlock — all drivers show BUSY but trips completed | Available driver count near zero; trip completion events not releasing driver status | Reaper job: reset drivers stuck BUSY > 2 hours without active trip; fix state transition bug; manual ops dashboard to force-release |
| Surge multiplier stuck at 3× after event ends | Rider complaints; surge counter not decaying; supply/demand ratio shows equilibrium but multiplier high | TTL on surge keys (auto-expire after 5 min without refresh); manual zone reset via ops console; postmortem on Flink aggregation lag |
Cost Drivers (Staff lens)
- Location ingest: 50K drivers × 0.25 updates/sec × 200 bytes = ~2.5 GB/hr Kafka — manageable; storage for 90-day replay is costly
- ETA graph queries: cache hit ratio is the cost lever — each 10% miss increase adds ~1K QPS to graph cluster
- Push notifications for driver offers: 500K offers/day × $0.001/push — minor vs driver incentives
Multi-Region & DR
Each metro is a shard (natural isolation). Trip data stays in region for compliance. Cross-region only for roving riders (rare) — route to home region API. Active-active for read-heavy ETA cache; matching is single-region per request.