This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Order Management System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Order state machine, Fulfillment orchestration, Return/refund flow? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Order state machine
- Fulfillment orchestration
- Return/refund flow
- Event-driven status updates
- Capacity estimation with shown math
Out of scope (state explicitly)
- Recommendation engine (#48)
- Review/rating system (#70)
- Warehouse management (WMS) internals
Assumptions
- Strong consistency required on money/inventory paths — clarify idempotency early
- External PSP or bank APIs exist; design integration boundaries only
- 99.99% availability target for the commit/authorize path
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Place order: Create order from cart with shipping address, payment method, and delivery preferences
- Order lifecycle: Track states: placed, payment_confirmed, processing, shipped, out_for_delivery, delivered, returned, cancelled
- Multi-item orders: Orders with items from multiple sellers/warehouses (split shipments)
- Order tracking: Real-time shipment tracking with carrier integration (FedEx, UPS, USPS)
- Returns & refunds: Initiate return, generate return label, process refund on receipt
- Order history: View past orders with search, filter, and reorder capability
- Notifications: Email/SMS/push at each state transition
- Invoice generation: Generate PDF invoices for each order
- Cancellation: Cancel order before shipment; partial cancellation for multi-item orders
- Order modification: Change shipping address or delivery date before processing cutoff
- Strong Consistency: Order state transitions must be ACID: no lost orders, no double charges
- Availability: 99.99%: order placement is revenue-critical
- Low Latency: Order placement in < 2 seconds (including payment)
- Scalability: 10M+ orders/day; 50M+ order status checks/day
- Idempotent: Duplicate order submission must not create two orders
- Auditability: Every state change immutably logged
- Durability: Order data retained for 7+ years (regulatory compliance)
| Metric | Calculation | Value |
|---|---|---|
| Orders / day | Given (assumption documented in value) | 10M |
| Orders / sec | 10M ÷ 86400 | ~115 (peak 10K during sales) |
| Order status checks / day | Given (assumption documented in value) | 50M |
| Avg items per order | Given (typical workload assumption) | 3 |
| Order record size | Given | ~5 KB (with items, addresses, payment) |
| Storage / day | 10M x 5 KB | 50 GB |
| Storage / year | Given | ~18 TB |
Order State Machine
Event Bus Design (Kafka)
Topic: order_management_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "order_management_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: order_management_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design an Order Management System: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Order Placement — Saga Pattern
Order placement uses SAGA (not 2PC). Step 1 creates order, Step 2 reserves inventory, Step 3 charges payment, Step 4 confirms. On failure, compensating transactions rollback in reverse order.
Placing an order involves multiple services. Use SAGA pattern (not 2PC):
Step 1: Create order (Order Service)
INSERT order with status = 'PLACED'
Step 2: Reserve inventory (Inventory Service)
For each item: reserve stock at selected warehouse
If ANY item out of stock --> compensate: cancel order, release other reservations
Step 3: Charge payment (Payment Service)
Authorize + capture payment
If payment fails --> compensate: release all inventory reservations, mark order PAYMENT_FAILED
Step 4: Confirm order (Order Service)
Update status = 'PAYMENT_CONFIRMED'
Publish order-confirmed event
Step 5 (async): Generate invoice, send confirmation email, update analytics
Saga compensation (rollback):
If Step 3 fails (payment declined):
- Undo Step 2: release inventory reservations
- Undo Step 1: mark order as PAYMENT_FAILED
- Notify user: "Payment failed. Your items are still in your cart."
If Step 2 fails (out of stock):
- Undo Step 1: mark order as CANCELLED
- Notify user: "Sorry, [item] is no longer available."
Why SAGA (not distributed transaction / 2PC)?
2PC: requires all services to hold locks simultaneously --> blocks everything if one is slow
SAGA: each step commits independently; compensating transactions undo on failure
At e-commerce scale, SAGA is the industry standard (Amazon, Shopify, etc.)Split Shipments
Split Shipments — Example: An order with 3 items sourced from different warehouses is automatically split: - Item A — only available at WH-NYC - Item B — only available at WH-LAX - Item C — available at both warehouses, allocated to WH-NYC (proximity) Result: 2 independent shipments: - Shipment 1 (WH-NYC): Item A + Item C → FedEx tracking #12345 - Shipment 2 (WH-LAX): Item B → UPS tracking #67890 Each shipment has its own tracking, carrier, state machine, and EDD. Parent order state is derived: - If any shipment SHIPPED → order = PARTIALLY_SHIPPED - All shipments DELIVERED → order = DELIVERED - Partial cancellation cancels one shipment independently.
Idempotent Order Placement
Problem: User clicks "Place Order" and network times out. User clicks again.
Without idempotency: two identical orders created, charged twice.
Solution:
Client generates idempotency_key (UUID) on checkout page load.
Both clicks send same idempotency_key.
Server:
1. Check Redis: GET idempotency:{key}
If exists --> return cached order (already processed)
2. Check PostgreSQL: SELECT order_id FROM orders WHERE idempotency_key = ?
If exists --> return existing order
3. If neither exists --> proceed with order creation
4. After creation: SET idempotency:{key} {order_id} in Redis (TTL 24h)
Result: exactly one order, regardless of retries.Place Order
POST /api/v1/orders
Idempotency-Key: "order-abc-123"
{
"items": [
{"sku_id": "SKU-123", "quantity": 2, "price": 29.99},
{"sku_id": "SKU-456", "quantity": 1, "price": 49.99}
],
"shipping_address": { "street": "...", "city": "...", "zip": "...", "country": "US" },
"payment_method_id": "pm-uuid",
"delivery_preference": "standard"
}
Response: 201 Created
{
"order_id": "order-uuid",
"status": "payment_pending",
"estimated_delivery": "2026-03-18",
"total": 109.97,
"shipments": [
{"shipment_id": "ship-1", "items": ["SKU-123","SKU-456"], "warehouse": "WH-NYC"}
]
}Get Order Status
GET /api/v1/orders/{order_id}
Response: 200 OK
{
"order_id": "order-uuid",
"status": "shipped",
"placed_at": "2026-03-14T10:00:00Z",
"items": [...],
"shipments": [
{"shipment_id": "ship-1", "status": "in_transit", "carrier": "FedEx",
"tracking_number": "794644790132", "estimated_delivery": "2026-03-18",
"tracking_url": "https://fedex.com/track?id=794644790132"}
],
"payment": { "method": "Visa ending 4242", "amount": 109.97, "status": "captured" }
}Cancel Order
POST /api/v1/orders/{order_id}/cancel
{ "reason": "changed_mind" }
Response: 200 OK
{ "status": "cancelled", "refund_amount": 109.97, "refund_status": "processing" }Initiate Return
POST /api/v1/orders/{order_id}/return
{
"items": [{"sku_id": "SKU-123", "quantity": 1, "reason": "defective"}],
"return_method": "mail"
}
Response: 200 OK
{
"return_id": "ret-uuid",
"return_label_url": "https://s3.../return-label.pdf",
"refund_amount": 29.99,
"refund_status": "pending_return_receipt"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 402 Payment Required: insufficient funds 502 Bad Gateway: payment provider timeout; poll status endpoint
PostgreSQL: Source of Truth
CREATE TABLE orders (
order_id UUID PRIMARY KEY,
user_id UUID NOT NULL,
status VARCHAR(30) NOT NULL DEFAULT 'placed',
subtotal DECIMAL(10,2),
tax DECIMAL(10,2),
shipping_cost DECIMAL(10,2),
total DECIMAL(10,2),
currency CHAR(3) DEFAULT 'USD',
shipping_address JSONB,
payment_method_id VARCHAR(64),
payment_status VARCHAR(20),
idempotency_key VARCHAR(64) UNIQUE,
placed_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
INDEX idx_user (user_id, placed_at DESC),
INDEX idx_status (status)
);
CREATE TABLE order_items (
item_id BIGSERIAL PRIMARY KEY,
order_id UUID NOT NULL REFERENCES orders(order_id),
sku_id VARCHAR(50) NOT NULL,
quantity INT NOT NULL,
unit_price DECIMAL(10,2),
total_price DECIMAL(10,2),
shipment_id UUID,
status VARCHAR(20) DEFAULT 'active',
INDEX idx_order (order_id)
);
CREATE TABLE shipments (
shipment_id UUID PRIMARY KEY,
order_id UUID NOT NULL REFERENCES orders(order_id),
warehouse_id VARCHAR(50),
carrier VARCHAR(20),
tracking_number VARCHAR(64),
status VARCHAR(30) DEFAULT 'pending',
shipped_at TIMESTAMPTZ,
delivered_at TIMESTAMPTZ,
INDEX idx_order (order_id),
INDEX idx_tracking (tracking_number)
);
CREATE TABLE order_state_log (
log_id BIGSERIAL PRIMARY KEY,
order_id UUID NOT NULL,
from_state VARCHAR(30),
to_state VARCHAR(30) NOT NULL,
actor VARCHAR(50),
reason TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
INDEX idx_order (order_id, created_at)
);
CREATE TABLE returns (
return_id UUID PRIMARY KEY,
order_id UUID NOT NULL,
user_id UUID NOT NULL,
status VARCHAR(20) DEFAULT 'initiated',
refund_amount DECIMAL(10,2),
reason TEXT,
return_label_url TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
INDEX idx_order (order_id)
);Redis: Caching + Idempotency
# Order status cache (avoid DB hit for frequent polling)
order_status:{order_id} --> Hash { status, tracking, updated_at }
TTL: 300
# Idempotency (prevent duplicate order placement)
idempotency:{key} --> order_id (if already processed)
TTL: 86400
# User's recent orders (for quick display)
user_orders:{user_id} --> List of order_ids (last 20)
TTL: 3600Kafka Topics
Topic: order-events (placed, confirmed, shipped, delivered -- all state transitions) Topic: payment-events (payment success/failure -- consumed by order service) Topic: shipment-events (tracking updates from carrier webhooks) Topic: return-events (return initiated, received, refunded)
| Concern | Solution |
|---|---|
| Duplicate order | Idempotency key (unique per checkout attempt); DB UNIQUE constraint |
| Payment succeeds but order update fails | Saga with compensation: payment service publishes event; order service consumes and updates; if missed, reconciliation job retries |
| Inventory reserved but payment fails | Compensation step releases inventory; TTL-based reservation auto-expires |
| State machine corruption | Valid transitions enforced in code + DB CHECK constraint; state_log is immutable audit trail |
| Carrier webhook missed | Poll carrier API every 30 min for orders in SHIPPED state; reconcile |
| DB failover | PostgreSQL synchronous replication; promote standby with zero data loss |
Interview Walkthrough
- Lead with the order state machine: CREATED → PAYMENT_PENDING → CONFIRMED → SHIPPED → DELIVERED — each transition is an event, not a silent UPDATE.
- Explain saga orchestration over 2PC: reserve inventory → charge payment → create shipment, with compensating transactions on failure.
- Cover idempotency keys on place-order — at 10M orders/day even 1% retries means 100K potential duplicates without dedup.
- Walk through PostgreSQL as write path (ACID state transitions) and Elasticsearch as read path (customer-service search via CDC).
- Mention Kafka events for downstream: warehouse pick lists, email notifications, analytics — none block the synchronous order response.
- Discuss cancellation rules: only allow cancel before SHIPPED; release inventory reservation and initiate refund as compensating steps.
- Common pitfall: using 2PC across inventory, payment, and shipping — one slow service blocks all participants and the coordinator becomes a SPOF.
Saga vs 2PC for Distributed Order Placement
2PC (Two-Phase Commit):
Coordinator asks all services: "Can you commit?"
All say yes --> "Commit." All commit atomically.
Problems at e-commerce scale:
- Coordinator is SPOF
- All services hold locks during prepare phase --> latency, blocking
- If any service is slow --> ALL are blocked
- Not supported across heterogeneous systems (different DBs, services)
Saga (Choreography or Orchestration):
Each step commits locally. Failures trigger compensating transactions.
Choreography: each service publishes event; next service reacts
Order created --> Inventory listens, reserves stock --> Payment listens, charges
Loose coupling but hard to debug (no central view of saga progress)
Orchestration (recommended): central orchestrator coordinates steps
Order Service calls Inventory, then Payment, then Shipping
On failure: orchestrator calls compensating actions in reverse order
Easy to monitor, debug, and modify
Trade-off:
2PC: strong consistency, but fragile and slow at scale
Saga: eventual consistency during saga execution, but resilient and fast
For e-commerce: Saga is the clear winner. Brief inconsistency during
the 2-second order placement window is acceptable.Order Search: Why Elasticsearch
Customer service needs: "Find all orders for user X with product Y shipped in March" PostgreSQL can do this, but: - Full-text search across product names, addresses is slow - Complex compound filters with pagination are expensive - At 10M orders/day, queries slow down without heavy indexing Elasticsearch: - Index order data (denormalized): order_id, user, items, status, dates - Support full-text search + filters + aggregations - Sub-200ms response for complex queries CDC pipeline: PostgreSQL --> Debezium --> Kafka --> ES consumer --> Elasticsearch Lag: < 5 seconds from order placed to searchable in ES Use PostgreSQL for: order placement, state transitions (ACID) Use Elasticsearch for: search, customer service dashboard, analytics queries
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core order management system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.