This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Food Delivery Platform (DoorDash / Zomato).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What city scale — concurrent trips, drivers, location update rate? | Drives geo-index choice, matching QPS, and streaming ingestion throughput. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Three-sided marketplace (user, restaurant, driver)
- Order dispatch
- Real-time tracking
- Kitchen prep time estimation
- Batching orders
Out of scope (state explicitly)
- Full payment processing (#24)
- Turn-by-turn map rendering (#54)
- Driver/rider identity verification and background checks
Assumptions
- Single metro / region unless interviewer asks for multi-city
- Mobile clients with intermittent connectivity — server is source of truth
- Managed geo + messaging infra (Kafka, Redis, RDS) is acceptable
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Customer: Browse restaurants, view menus, place orders, track delivery in real-time
- Restaurant: Manage menu items, accept/reject orders, update order status (preparing → ready)
- Delivery Partner (Dasher): Go online/offline, accept delivery requests, navigate to pickup/dropoff
- Order lifecycle: Browse → Cart → Checkout → Restaurant Accept → Preparing → Ready → Picked Up → Delivered
- Search: Search restaurants by name, cuisine, location
- Ratings & Reviews: Rate restaurant and dasher after delivery
- Payment: Process payment, handle tips, restaurant payouts, dasher earnings
- Promotions: Coupons, discounts, free delivery offers
- ETA: Estimated delivery time considering prep time + pickup + travel
- High Availability: 99.99%: especially during meal rushes (lunch/dinner peaks)
- Low Latency: Search results < 200 ms, order placement < 1 s
- Real-time Tracking: Dasher location updates every 4 seconds
- Scalability: 30M+ orders/day, 1M+ restaurants, 5M+ dashers
- Consistency: Order and payment must be strongly consistent (no double charges, no lost orders)
- Fault Tolerant: Active orders must survive any single component failure
| Metric | Calculation | Value |
|---|---|---|
| Orders / day | Given | 30M |
| Orders / sec | 30M ÷ 86400 (+ peak factor) | ~350 (peak 1,500 during dinner) |
| Restaurants | Given | 1M |
| Active dashers at any time | Given | 500K |
| Dasher location updates / sec | 500K dashers ÷ 4s | 125K |
| Menu items total | Given | 50M |
| Search QPS | Given | 50K |
Restaurant & Menu Service
- Menu hierarchy: Restaurant → Menu Categories → Menu Items → Modifiers/Add-ons
- Dynamic menus: Items can be available/unavailable based on time of day, stock
- Search: Elasticsearch indexes restaurant name, cuisine type, menu items, tags
- Recommendation: "Popular near you", "Based on your past orders"
Order Service: The Core State Machine
- Idempotent transitions: Each state change has a unique event_id → replay-safe
- Saga pattern: Order placement involves multiple services (inventory, payment, restaurant, dasher). Use saga with compensating actions on failure
Dispatch / Assignment Service: Complex Optimization
Unlike Uber (1 rider → 1 driver), food delivery has a batching opportunity: one dasher can pick up orders from the same restaurant or nearby restaurants.
Matching Algorithm:
- When an order is ready (or near-ready), find candidate dashers:
- Available dashers within radius of restaurant
- Dashers with current deliveries whose route passes near the restaurant
- Score each candidate:
- Distance/ETA to restaurant
- Current delivery load (0, 1, or 2 active orders)
- Dasher's acceptance rate
- Customer priority (premium subscribers)
- Batching optimization: If two orders are from the same restaurant → assign to same dasher
- Assign top-scored dasher, send dispatch request (30-second timeout to accept)
Proactive dispatching: Start finding a dasher BEFORE the food is ready → dasher arrives at restaurant just as food is ready (reduces total delivery time)
- ETA model:
prep_time_estimate + travel_to_restaurant→ dispatch whenprep_time_remaining ≈ travel_to_restaurant
ETA Service
- Components of delivery ETA:
total_eta = prep_time + dasher_to_restaurant_time + restaurant_to_customer_time + buffer
- Prep time estimation: ML model trained on historical data per restaurant
- Features: restaurant, day of week, time of day, order complexity, current pending orders
- Travel time: Routing engine (OSRM/Google Maps) with real-time traffic
- Buffer: 5-minute padding for pickup logistics
Payment Service
- Payment flow: Authorization on order placement → Capture on delivery confirmation
- Split payment: Customer pays → platform takes commission → restaurant gets food amount → dasher gets delivery fee + tip
- Refunds: For missing items, late delivery, quality issues → automated rules + manual review
Dasher Service (Location + Geospatial Index)
- GPS ingestion: Dasher app sends location every 4 seconds via WebSocket → Dasher Service updates the geospatial index
- Geospatial index: Redis
GEOADD dashers:available {lng} {lat} {dasher_id}for finding nearby dashers →GEORADIUSqueries during dispatch - Dasher state: Redis hash
dasher:id→{"status": "available|picking_up|delivering", "active_orders": [order_ids], "lat", "lng", "last_updated"} - Kafka publishing: Every location update → Kafka
dasher-locationtopic → consumed by Real-Time Tracking Service and Analytics - Dasher availability management: When dasher goes online/offline, update availability index. When dispatched, set status to
picking_up→ remove from available pool
Real-Time Tracking Service
- Purpose: Stream live dasher location to the customer who is waiting for their order
- How:
- Consume dasher location events from Kafka
dasher-locationtopic - Filter: only forward locations for dashers with active deliveries
- Look up customer WebSocket connection for that order
- Push location update to customer app via WebSocket (every 4 seconds)
- Consume dasher location events from Kafka
- Customer UX: Map shows dasher's live position, estimated time of arrival, and distance
- Location trail storage: Write dasher route to Cassandra
delivery_trailtable (partition key = order_id, clustering key = timestamp): used for dispute resolution and analytics (TTL 30 days)
Notification Service
- Consumes order events from Kafka and triggers notifications to all parties:
- Customer: "Order confirmed", "Restaurant is preparing your food", "Dasher is on the way", "Arriving in 2 minutes", "Delivered! Rate your experience"
- Restaurant: "New order received" (with sound alert on tablet), "Dasher arriving for pickup"
- Dasher: "New delivery available" (with pickup/dropoff details and earnings estimate)
- Channels: Push notification (APNs/FCM), in-app alerts, SMS (for delivery confirmation to customers who don't have the app open)
- Urgency-based: "New order" to restaurant = high priority (immediate sound alert). "Rate your experience" = low priority (silent push, delayed 30 minutes after delivery)
Search Restaurants
GET /api/v1/restaurants?lat=37.77&lng=-122.42&cuisine=italian&sort=rating&radius=5kmPlace Order
POST /api/v1/orders
{
"restaurant_id": "rest-uuid",
"items": [
{"item_id": "item-1", "quantity": 2, "modifiers": ["extra cheese"]},
{"item_id": "item-2", "quantity": 1}
],
"delivery_address": {"lat": 37.77, "lng": -122.42, "address": "..."},
"payment_method_id": "pm-uuid",
"tip_amount": 5.00,
"promo_code": "SAVE20"
}
Response: 201 Created
{
"order_id": "order-uuid",
"status": "placed",
"estimated_delivery_at": "2026-03-13T11:15:00Z",
"total": {"subtotal": 28.00, "delivery_fee": 3.99, "tax": 2.52, "tip": 5.00, "discount": -5.60, "total": 33.91}
}Restaurant Accept/Reject
POST /api/v1/orders/{order_id}/accept
{ "estimated_prep_time_minutes": 20 }
POST /api/v1/orders/{order_id}/reject
{ "reason": "too_busy" }Track Order (Real-time)
GET /api/v1/orders/{order_id}/track
WebSocket: /ws/v1/orders/{order_id}/trackCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 440 Login Timeout: WebSocket session expired; reconnect required
PostgreSQL: Orders (ACID required)
CREATE TABLE orders (
order_id UUID PRIMARY KEY,
customer_id UUID NOT NULL,
restaurant_id UUID NOT NULL,
dasher_id UUID,
status VARCHAR(20),
delivery_address JSONB,
subtotal DECIMAL(10,2),
delivery_fee DECIMAL(10,2),
tax DECIMAL(10,2),
tip DECIMAL(10,2),
discount DECIMAL(10,2),
total DECIMAL(10,2),
promo_code VARCHAR(32),
estimated_prep_min INT,
estimated_delivery_at TIMESTAMP,
actual_delivered_at TIMESTAMP,
placed_at TIMESTAMP,
payment_status VARCHAR(20),
INDEX idx_customer (customer_id, placed_at DESC),
INDEX idx_restaurant (restaurant_id, placed_at DESC),
INDEX idx_dasher (dasher_id, placed_at DESC)
);
CREATE TABLE order_items (
order_item_id UUID PRIMARY KEY,
order_id UUID REFERENCES orders,
menu_item_id UUID,
item_name VARCHAR(256),
quantity INT,
unit_price DECIMAL(10,2),
modifiers JSONB,
special_notes TEXT
);MySQL: Restaurant & Menu
CREATE TABLE restaurants (
restaurant_id UUID PRIMARY KEY,
name VARCHAR(256),
cuisine_type VARCHAR(64),
address TEXT,
lat DECIMAL(10,7),
lng DECIMAL(10,7),
rating DECIMAL(2,1),
price_range TINYINT,
avg_prep_time INT,
is_active BOOLEAN,
hours JSON
);
CREATE TABLE menu_items (
item_id UUID PRIMARY KEY,
restaurant_id UUID,
category VARCHAR(64),
name VARCHAR(256),
description TEXT,
price DECIMAL(10,2),
image_url TEXT,
is_available BOOLEAN,
modifiers JSON,
INDEX idx_restaurant (restaurant_id)
);Redis: Cart, Dasher Geo, Dispatch
# Shopping cart (session-based)
Key: cart:{customer_id}
Value: Hash { restaurant_id, items: JSON, updated_at }
TTL: 3600
# Dasher geospatial
GEOADD dashers:available:{city} {lng} {lat} {dasher_id}
# Active orders per dasher
Key: dasher:orders:{dasher_id}
Value: Set of order_ids (max 2)Kafka Topics
Topic: order-events (placed, confirmed, preparing, ready, picked_up, delivered, cancelled) Topic: dasher-locations (high throughput) Topic: dispatch-requests (matching assignments)
| Concern | Solution |
|---|---|
| Order placed but payment fails | Saga: compensate by cancelling order, releasing restaurant capacity |
| Restaurant tablet offline | SMS/phone call fallback; auto-cancel after 5 min if not confirmed |
| Dasher goes offline mid-delivery | Detect via heartbeat timeout → reassign to new dasher |
| Duplicate order submission | Idempotency key on order placement API |
| Payment double charge | Payment authorization with idempotent capture |
| Peak load (dinner rush) | Auto-scale order service; queue orders in Kafka; circuit breaker on payment |
Saga Pattern for Order Placement
1. Create Order (Order Service) → Compensate: Delete Order 2. Authorize Payment (Payment Service) → Compensate: Release Authorization 3. Notify Restaurant (Restaurant Svc) → Compensate: Cancel Order at Restaurant 4. Dispatch Dasher (Dispatch Service) → Compensate: Cancel Dispatch If step 3 fails → execute compensations for steps 2, 1 (reverse order)
Dynamic Delivery Fees
delivery_fee = base_fee + distance_fee + surge_fee
base_fee = $2.99
distance_fee = $0.50 per mile beyond 3 miles
surge_fee = demand_multiplier * $1.00 (peak hours)Restaurant Capacity Management
- Restaurants can set max concurrent orders (e.g., 15)
- When at capacity → mark as "busy" in search results, increase estimated prep time
- Prevent overloading the kitchen during peak hours
Batched Delivery
- One dasher picks up from Restaurant A + Restaurant B (nearby) → delivers both
- Routing optimization: minimize total route distance
- Customer sees a slight delay but cheaper delivery (shared fee)
Fraud Detection
- Fake deliveries (dasher marks delivered without delivery)
- Multiple refund abuse by customers
- Detect via: GPS verification at delivery address, photo proof of delivery, ML anomaly detection
Dispatch Matching Algorithm: The Core Challenge
Given: 100 ready orders + 80 available dashers. Assign optimally. Naive: nearest dasher to each restaurant. Problem: globally suboptimal. Order A assigns Dasher X (closest). Order B's only option was also Dasher X. Now Order B waits much longer. Optimal: Hungarian Algorithm (bipartite matching, O(n³)): Build cost matrix: C[i][j] = cost of assigning dasher j to order i Cost = α * distance_to_restaurant + β * estimated_delivery_time + γ * dasher_utilization Solve for minimum total cost assignment. At 100 orders × 80 dashers: 100³ = 1M operations → < 10 ms. Fast enough. Run every 30 seconds to batch assignments (not per-order — batching is better globally). Practical: at scale (10K orders × 5K dashers in a city), Hungarian is too slow. Use: greedy approximation with local search refinement. Or: LP relaxation (linear programming) solved with simplex method. Real-time constraints: Food goes cold: max 5 min from "ready" to pickup → hard deadline Dasher at capacity: max 2 concurrent deliveries → constraint Dasher preferences: some dashers prefer short trips → soft constraint
Interview Walkthrough
- Split order lifecycle (state machine: placed → confirmed → picked up → delivered) from real-time tracking (location stream).
- Explain restaurant menu/catalog as read-heavy with Caching Patterns vs order writes requiring transactional consistency.
- Cover dispatch: match courier to order using geospatial search + ETA from a precomputed routing service.
- Discuss live tracking via WebSocket push of courier GPS — don't poll the API every second.
- Mention idempotent order placement and inventory reservation to prevent double-charging on retries.
- Common pitfall: coupling menu browse and order placement in one monolith DB — catalog reads starve checkout writes.
Race Conditions in Dispatch
Scenario: Two dispatch workers process different orders simultaneously.
Both score Dasher X as the best candidate for their respective orders.
Both attempt to assign Dasher X → one order gets the dasher, other waits.
Solution 1: Distributed Lock (Redis SETNX) ⭐
Before sending dispatch request:
SET dasher:lock:{dasher_id} {order_id} NX EX 30
NX = only if not exists (atomic)
EX = expires in 30 seconds (prevents deadlock on crash)
If SET succeeds → we "own" this dasher → send dispatch request
If SET fails → dasher already assigned → try next candidate
On dasher accept/decline → release lock
✓ Simple, fast (< 1ms Redis call)
✗ False exclusion: dasher locked for 30s even if first order is declined
Mitigation: 15-second accept window + immediate lock release on declineSolution 2: Optimistic Concurrency in DB
UPDATE dashers
SET current_order_count = current_order_count + 1,
version = version + 1
WHERE dasher_id = ? AND current_order_count < 2 AND version = ?
If rows_affected = 1 → success (dasher assigned)
If rows_affected = 0 → concurrent assignment happened → retry with next dasher
✓ No external lock service
✗ Higher latency than Redis (DB round-trip)
Solution 3: Centralized Dispatch Queue per Zone ⭐⭐
Partition city into zones (e.g., H3 resolution-7 hexagons)
Each zone has ONE dispatch worker (single-writer)
Worker consumes orders from Kafka partition for that zone
Single-threaded per zone → NO race conditions by design
✓ Zero lock contention
✓ Can do batch optimization (collect orders for 5 seconds, then assign)
✗ Zone boundary: order near boundary might miss dashers in adjacent zone
Mitigation: zones overlap by 1 km; dispatch worker checks neighboring zones
Recommended: Solution 3 for production (DoorDash's actual approach),
Solution 1 for simpler deploymentsOrder State Machine Failure Scenarios
Scenario 1: Restaurant tablet goes offline after CONFIRMED
Detection: No heartbeat from tablet for 2 minutes
Action:
T+2min: Send SMS to restaurant phone number
T+5min: Auto-call restaurant landline (IVR: "Press 1 to confirm order")
T+8min: If still no response → auto-cancel order
→ Compensate: refund customer, release dasher assignment
→ Notify customer: "Restaurant unreachable, order cancelled. Here are similar restaurants."
Edge case: Restaurant was preparing food but tablet died
→ When tablet reconnects, check if order was auto-cancelled
→ If food was prepared → restaurant eats the cost (SLA violation)
→ Platform may compensate restaurant for confirmed-then-cancelled orders
Scenario 2: Dasher app crashes during PICKED_UP
Detection: No GPS heartbeat for 3 minutes
Action:
T+3min: Send push notification + SMS to dasher
T+5min: Call dasher's phone
T+8min: If unreachable → reassign to new dasher
→ Challenge: new dasher doesn't have the food!
→ If dasher had the food and went offline → customer gets refund
→ If dasher reconnects → resume delivery (update ETA)
Prevention: "Proof of pickup" — dasher confirms pickup code from restaurant
→ System knows food was picked up → different handling than pre-pickup crash
Scenario 3: Payment capture fails after delivery
Flow: Payment authorized at order placement, captured on delivery confirmation
If capture fails (expired card, bank decline):
→ Retry capture 3 times with exponential backoff
→ If all retries fail → queue for manual payment recovery
→ Restaurant and dasher STILL get paid (platform absorbs the loss)
→ Never re-charge customer for same order (idempotency_key on capture)
→ Flag customer account for future orders (require upfront payment)
Scenario 4: Customer cancels after food is being prepared (PREPARING state)
Cancellation policy based on state:
PLACED (not confirmed): Full refund, no penalty
CONFIRMED (not preparing): Full refund, restaurant gets small fee ($2)
PREPARING: Partial refund (50%), restaurant keeps food cost
READY / PICKED_UP: No refund (food is ready, dasher is en route)
Who pays for the food?
Platform absorbs cost for CONFIRMED cancellations (acquisition cost)
Customer pays for PREPARING+ cancellations (food already in progress)
Restaurant always receives payment for food they've started preparingReal-Time Tracking Data Flow
GPS Optimization (Adaptive Frequency)
- Dasher stationary (speed = 0): Update every 30 seconds (saves battery).
- Dasher walking to restaurant (< 5 km/h): Update every 10 seconds.
- Dasher driving (> 10 km/h): Update every 4 seconds.
- Dasher approaching customer (< 500m): Update every 2 seconds for highly precise final ETA.
GPS Jitter & Noise Handling (Kalman Filtering)
Raw mobile GPS suffers from a ±10–30 meter inaccuracy due to atmospheric distortions and urban canyons (multipath reflection off skyscrapers). Without smoothing, the UI would show a dasher "jumping" between buildings or crossing rivers erratically.
We apply a Kalman Filter to predict the next logical coordinates using the dasher's previous speed, heading, and current reading, heavily discounting GPS measurements when the sensor reports low confidence.
Map Snapping
After Kalman filtering, the coordinate stream is snapped onto the nearest road segment utilizing the local street map graph database (e.g., OpenStreetMap or Google Roads API) ensuring the vehicle is cleanly displayed driving down active streets.
Dynamic ETA Updates
Delivery ETAs are not simple linear estimates. We periodically recalculate the remaining trip length by querying real-time traffic segments on the route, adding a 2–3 minute buffer for parking logistics, elevator transits, and building access control.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core food delivery flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.