Interview Prompt
Design Surge Pricing System like Uber or Lyft.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Real-time supply/demand ratio, Dynamic multiplier, Zone-level pricing? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Real-time supply/demand ratio
- Dynamic multiplier
- Zone-level pricing
- Price smoothing
- Passenger communication
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full payment processing (#24)
- Turn-by-turn map rendering (#54)
- Driver/rider identity verification and background checks
Assumptions
- Clarify scale (DAU, QPS, data volume) for surge pricing system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Dynamic pricing: Adjust ride prices based on real-time supply (drivers) / demand (ride requests)
- Geospatial granularity: Different surge multipliers per geographic zone (H3 hexagons)
- Real-time computation: Recalculate surge every 60 seconds
- Surge display: Show multiplier before booking confirmation
- Surge caps: Maximum 8x; emergency caps (1x during disasters)
- Driver incentives: Show high-surge zone heatmap to attract supply
- Predictive surge: Forecast upcoming demand spikes using ML
- Smooth transitions: Gradual ramp up/down to avoid oscillation
- Low Latency: Surge lookup per zone in < 10 ms
- Freshness: Current conditions reflected within 2 minutes
- Scale: 500K active zones, 100K surge lookups/sec, 2M driver GPS/sec
- Availability: 99.99%: in critical path for every ride request
| Metric | Calculation | Value |
|---|---|---|
| Geographic zones (H3 res 7) | Given | ~10M globally |
| Active zones (with activity) | Given | ~500K |
| Recalculation frequency | Given (assumption documented in value) | Every 60 seconds |
| Ride requests / sec | From Ride requests / day ÷ 86400 (+ peak factor in value) | 100K |
| Driver location updates / sec | From Driver location updates / day ÷ 86400 (+ peak factor in value) | 2M |
| Surge lookups / sec | From Surge lookups / day ÷ 86400 (+ peak factor in value) | 100K |
Surge Calculation Algorithm
For each zone (H3 cell, resolution 7 ~ 5.16 km^2), every 60 seconds:
demand = count(ride_requests in zone, last 5 minutes) / 5 (per-minute rate)
supply = count(available_drivers in zone, last 5 minutes) / 5
ratio = demand / max(supply, 1)
surge = piecewise_function(ratio):
ratio <= 1.0: 1.0 (no surge)
ratio 1.0-1.5: 1.0 + (ratio - 1.0) * 0.5
ratio 1.5-3.0: 1.25 + (ratio - 1.5) * 1.0
ratio > 3.0: min(ratio, 8.0)
Smoothing (prevent oscillation):
smoothed = 0.7 * previous_surge + 0.3 * calculated_surge
Without smoothing:
T=0: high demand -> surge 3x -> riders cancel -> demand drops -> surge 1x
T=1: riders see 1x -> rush back -> surge 3x -> cycle repeats
With smoothing: gradual change over 3-5 minutes. Stable UX.H3 Hexagons & Boundary Blending
Grid squares: corner cells have different distances from center than edges. H3 hexagons: all 6 neighbors equidistant. Better circle approximation. Resolution 7 (~5 km^2): surge zones Resolution 9 (~175m): precise driver matching Boundary blending: rider_surge = 0.6 * zone_surge + 0.4 * avg(neighbor_surges) Prevents hard surge boundaries (crossing one street changes price dramatically).
Predictive Surge (ML)
Features: historical demand (same hour/day/week), weather, events (concert end time), real-time trend (demand increasing?), time of day. Model: XGBoost per-zone. Prediction horizon: 15-30 min. Use case: "Zone X will have 3x surge in 15 min" -> show drivers incentive to head there. Result: supply arrives BEFORE demand spike -> surge is lower -> better UX for everyone.
Flink Window Semantics
Tumbling window (5 min): counts reset every 5 min.
At minute 4:59 -> high surge. At minute 5:00 -> counter resets to 0 -> surge drops to 1x.
Sudden drops at window boundaries = bad UX.
Sliding window (5 min window, 1 min slide):
At any given second, the window covers the LAST 5 minutes.
Every 1 minute, Flink re-evaluates with updated counts.
No sudden resets. Smooth, continuous surge updates.
Implementation in Flink:
DataStream<RideRequest> requests = ...
requests
.keyBy(event -> event.zoneId)
.window(SlidingEventTimeWindows.of(
Time.minutes(5), Time.minutes(1)))
.aggregate(new DemandSupplyAggregator())
.map(new SurgeCalculator())
.addSink(new RedisSink());
Watermark strategy:
Allow 10-second out-of-orderness for late GPS events.
Events arriving > 10 seconds late are dropped (acceptable for surge accuracy).GET /api/v1/surge?lat=37.7749&lng=-122.4194
Response: 200 OK
{
"zone_id": "872830926cfffff",
"surge_multiplier": 2.3,
"estimated_fare": { "base": 15.00, "surged": 34.50 },
"message": "Prices are 2.3x due to high demand",
"updated_at": "2026-03-14T11:00:00Z"
}
GET /api/v1/surge/heatmap?ne_lat=37.82&ne_lng=-122.35&sw_lat=37.70&sw_lng=-122.52
Response: 200 OK
{ "zones": [{ "zone_id": "...", "surge": 2.3, "center": [37.78, -122.41] }, ...] }Redis: Current Surge
surge:{zone_id} --> Hash { multiplier: 2.3, demand: 45, supply: 20, updated_at: ts }
TTL: 120 seconds (stale if not refreshed)
surge_cap:{city} --> FLOAT (admin override, e.g., 1.0 during emergency)
No TTL (manually removed)ClickHouse: Historical Surge
CREATE TABLE surge_history (
zone_id String, multiplier Float32, demand UInt32, supply UInt32,
timestamp DateTime, date Date MATERIALIZED toDate(timestamp)
) ENGINE = MergeTree() PARTITION BY toYYYYMM(timestamp)
ORDER BY (zone_id, timestamp);Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 402 Payment Required: insufficient funds 502 Bad Gateway: payment provider timeout; poll status endpoint
| Concern | Solution |
|---|---|
| Surge service down | Default to 1.0x: underprice rather than block rides |
| Stale surge data | TTL 120s; if expired, use historical pattern or 1.0x |
| Location data lag | Use last known positions; degrade gracefully |
| Emergency events | Admin override: SET surge_cap:{city} 1.0: instant |
| Oscillation | Exponential smoothing prevents wild swings |
Interview Walkthrough
- Frame surge as a supply-demand feedback loop, not price gouging — explain how multiplier attracts drivers and dampens rider demand.
- Walk through H3 hex zones (resolution 7, ~5 km²) with neighbor blending to avoid cliff-edge pricing at zone boundaries.
- Explain the ratio: surge = f(demand/supply) with exponential smoothing (α ≈ 0.3) to prevent ping-pong oscillation.
- Cover real-time pipeline: ride requests + driver heartbeats → Flink windowed aggregation → Redis current multiplier per zone.
- Mention transparency: show multiplier before booking with a "wait and save" estimate when surge is decaying.
- Discuss ethical guardrails: auto-cap at 1.0× during detected emergencies, manual ops override, regulatory caps (NYC 2.5×).
- Common pitfall: updating surge instantly on every request — drivers chase a spike that vanishes before they arrive, causing supply whiplash.
Ethical Considerations
Surge during emergencies (hurricane, attack): Auto-detect: demand > 10x normal AND news API reports emergency -> cap at 1.0x Manual override: ops can cap any city instantly Regulatory: some cities mandate caps (NYC: 2.5x during emergencies) Transparency: Show surge BEFORE booking (rider chooses to accept or wait) "Wait and save" option: "Surge likely to decrease in ~10 minutes" Fare estimate with surge shown prominently (no surprise at end) Why surge is necessary: 1. Incentivizes drivers to high-demand areas (supply response) 2. Reduces demand (riders who can wait, do wait) 3. Without surge: high-demand periods have ZERO drivers -> worse for everyone
Zone Size Trade-off
Large zones (10 km^2): ✓ More data points per zone -> more accurate demand/supply estimate ✗ Masks hyperlocal demand (airport vs nearby residential) Small zones (0.5 km^2): ✓ Precise surge reflecting local conditions ✗ Fewer data points -> noisy, unreliable estimates ✗ "Surge boundary" problem: crossing one street changes price Sweet spot: H3 resolution 7 (~5 km^2) with neighbor blending. For airports/stadiums: use resolution 8 (~1 km^2) custom zones.
Driver Supply Response: Feedback Loop
Surge creates a feedback loop:
High demand -> surge rises -> drivers see high-surge zone on heatmap
-> drivers drive to that zone -> supply increases -> surge decreases
Measurement: "supply elasticity to surge"
How many extra drivers appear per 1x increase in surge?
Typical: 1.5x surge attracts 30% more drivers within 10 minutes
3.0x surge attracts 100% more drivers within 15 minutes
Driver incentive push notification:
When zone surge > 2.0x AND duration > 3 minutes:
Push to drivers within 10 km:
"High demand in Downtown! Earn 2.3x fares. Estimated $45 for next ride."
Only push if driver is online, not in a ride, and hasn't been pushed in last 15 min
(prevent notification fatigue)
Surge decay on supply arrival:
As drivers arrive -> supply increases -> smoothed surge decreases
Important: decay must be gradual (smoothing factor 0.7)
If decay is too fast: drivers arrive, surge drops, drivers leave, surge rises again
(ping-pong effect)Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core surge pricing system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.