Design a Surge Pricing System like Uber or Lyft

Interview Prompt

Design Surge Pricing System like Uber or Lyft.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Real-time supply/demand ratio, Dynamic multiplier, Zone-level pricing?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Real-time supply/demand ratio
Dynamic multiplier
Zone-level pricing
Price smoothing
Passenger communication
Capacity estimation with shown math

Out of scope (state explicitly)

Full payment processing (#24)
Turn-by-turn map rendering (#54)
Driver/rider identity verification and background checks

Assumptions

Clarify scale (DAU, QPS, data volume) for surge pricing system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Dynamic pricing: Adjust ride prices based on real-time supply (drivers) / demand (ride requests)
Geospatial granularity: Different surge multipliers per geographic zone (H3 hexagons)
Real-time computation: Recalculate surge every 60 seconds
Surge display: Show multiplier before booking confirmation
Surge caps: Maximum 8x; emergency caps (1x during disasters)
Driver incentives: Show high-surge zone heatmap to attract supply
Predictive surge: Forecast upcoming demand spikes using ML
Smooth transitions: Gradual ramp up/down to avoid oscillation

Metric	Calculation	Value
Geographic zones (H3 res 7)	Given	~10M globally
Active zones (with activity)	Given	~500K
Recalculation frequency	Given (assumption documented in value)	Every 60 seconds
Ride requests / sec	From Ride requests / day ÷ 86400 (+ peak factor in value)	100K
Driver location updates / sec	From Driver location updates / day ÷ 86400 (+ peak factor in value)	2M
Surge lookups / sec	From Surge lookups / day ÷ 86400 (+ peak factor in value)	100K

Loading...

Surge Calculation Algorithm

For each zone (H3 cell, resolution 7 ~ 5.16 km^2), every 60 seconds:

  demand = count(ride_requests in zone, last 5 minutes) / 5  (per-minute rate)
  supply = count(available_drivers in zone, last 5 minutes) / 5
  
  ratio = demand / max(supply, 1)
  
  surge = piecewise_function(ratio):
    ratio <= 1.0:  1.0 (no surge)
    ratio 1.0-1.5: 1.0 + (ratio - 1.0) * 0.5
    ratio 1.5-3.0: 1.25 + (ratio - 1.5) * 1.0
    ratio > 3.0:   min(ratio, 8.0)

  Smoothing (prevent oscillation):
    smoothed = 0.7 * previous_surge + 0.3 * calculated_surge
    
    Without smoothing:
      T=0: high demand -> surge 3x -> riders cancel -> demand drops -> surge 1x
      T=1: riders see 1x -> rush back -> surge 3x -> cycle repeats
    With smoothing: gradual change over 3-5 minutes. Stable UX.

H3 Hexagons & Boundary Blending

Grid squares: corner cells have different distances from center than edges.
H3 hexagons: all 6 neighbors equidistant. Better circle approximation.
  Resolution 7 (~5 km^2): surge zones
  Resolution 9 (~175m): precise driver matching

Boundary blending:
  rider_surge = 0.6 * zone_surge + 0.4 * avg(neighbor_surges)
  Prevents hard surge boundaries (crossing one street changes price dramatically).

Predictive Surge (ML)

Features: historical demand (same hour/day/week), weather, events (concert end time),
real-time trend (demand increasing?), time of day.

Model: XGBoost per-zone. Prediction horizon: 15-30 min.
Use case: "Zone X will have 3x surge in 15 min" -> show drivers incentive to head there.
Result: supply arrives BEFORE demand spike -> surge is lower -> better UX for everyone.

Flink Window Semantics

Tumbling window (5 min): counts reset every 5 min.
  At minute 4:59 -> high surge. At minute 5:00 -> counter resets to 0 -> surge drops to 1x.
  Sudden drops at window boundaries = bad UX.

Sliding window (5 min window, 1 min slide):
  At any given second, the window covers the LAST 5 minutes.
  Every 1 minute, Flink re-evaluates with updated counts.
  No sudden resets. Smooth, continuous surge updates.

Implementation in Flink:
  DataStream<RideRequest> requests = ...
  requests
    .keyBy(event -> event.zoneId)
    .window(SlidingEventTimeWindows.of(
        Time.minutes(5), Time.minutes(1)))
    .aggregate(new DemandSupplyAggregator())
    .map(new SurgeCalculator())
    .addSink(new RedisSink());

Watermark strategy:
  Allow 10-second out-of-orderness for late GPS events.
  Events arriving > 10 seconds late are dropped (acceptable for surge accuracy).

Concern	Solution
Surge service down	Default to 1.0x: underprice rather than block rides
Stale surge data	TTL 120s; if expired, use historical pattern or 1.0x
Location data lag	Use last known positions; degrade gracefully
Emergency events	Admin override: SET surge_cap:{city} 1.0: instant
Oscillation	Exponential smoothing prevents wild swings

Ethical Considerations

Surge during emergencies (hurricane, attack):
  Auto-detect: demand > 10x normal AND news API reports emergency -> cap at 1.0x
  Manual override: ops can cap any city instantly
  Regulatory: some cities mandate caps (NYC: 2.5x during emergencies)

Transparency:
  Show surge BEFORE booking (rider chooses to accept or wait)
  "Wait and save" option: "Surge likely to decrease in ~10 minutes"
  Fare estimate with surge shown prominently (no surprise at end)

Why surge is necessary:
  1. Incentivizes drivers to high-demand areas (supply response)
  2. Reduces demand (riders who can wait, do wait)
  3. Without surge: high-demand periods have ZERO drivers -> worse for everyone

Zone Size Trade-off

Large zones (10 km^2):
  ✓ More data points per zone -> more accurate demand/supply estimate
  ✗ Masks hyperlocal demand (airport vs nearby residential)

Small zones (0.5 km^2):
  ✓ Precise surge reflecting local conditions
  ✗ Fewer data points -> noisy, unreliable estimates
  ✗ "Surge boundary" problem: crossing one street changes price

Sweet spot: H3 resolution 7 (~5 km^2) with neighbor blending.
For airports/stadiums: use resolution 8 (~1 km^2) custom zones.

Driver Supply Response: Feedback Loop

Surge creates a feedback loop:

  High demand -> surge rises -> drivers see high-surge zone on heatmap
  -> drivers drive to that zone -> supply increases -> surge decreases

Measurement: "supply elasticity to surge"
  How many extra drivers appear per 1x increase in surge?
  Typical: 1.5x surge attracts 30% more drivers within 10 minutes
  3.0x surge attracts 100% more drivers within 15 minutes

Driver incentive push notification:
  When zone surge > 2.0x AND duration > 3 minutes:
    Push to drivers within 10 km:
    "High demand in Downtown! Earn 2.3x fares. Estimated $45 for next ride."
    
  Only push if driver is online, not in a ride, and hasn't been pushed in last 15 min
  (prevent notification fatigue)

Surge decay on supply arrival:
  As drivers arrive -> supply increases -> smoothed surge decreases
  Important: decay must be gradual (smoothing factor 0.7)
  If decay is too fast: drivers arrive, surge drops, drivers leave, surge rises again
  (ping-pong effect)

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.