Design an Order Management System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Order Management System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Order state machine, Fulfillment orchestration, Return/refund flow?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Order state machine
Fulfillment orchestration
Return/refund flow
Event-driven status updates
Capacity estimation with shown math

Out of scope (state explicitly)

Recommendation engine (#48)
Review/rating system (#70)
Warehouse management (WMS) internals

Assumptions

Strong consistency required on money/inventory paths — clarify idempotency early
External PSP or bank APIs exist; design integration boundaries only
99.99% availability target for the commit/authorize path

Place order: Create order from cart with shipping address, payment method, and delivery preferences
Order lifecycle: Track states: placed, payment_confirmed, processing, shipped, out_for_delivery, delivered, returned, cancelled
Multi-item orders: Orders with items from multiple sellers/warehouses (split shipments)
Order tracking: Real-time shipment tracking with carrier integration (FedEx, UPS, USPS)
Returns & refunds: Initiate return, generate return label, process refund on receipt
Order history: View past orders with search, filter, and reorder capability
Notifications: Email/SMS/push at each state transition
Invoice generation: Generate PDF invoices for each order
Cancellation: Cancel order before shipment; partial cancellation for multi-item orders
Order modification: Change shipping address or delivery date before processing cutoff

Metric	Calculation	Value
Orders / day	Given (assumption documented in value)	10M
Orders / sec	10M ÷ 86400	~115 (peak 10K during sales)
Order status checks / day	Given (assumption documented in value)	50M
Avg items per order	Given (typical workload assumption)	3
Order record size	Given	~5 KB (with items, addresses, payment)
Storage / day	10M x 5 KB	50 GB
Storage / year	Given	~18 TB

Loading...

Order State Machine

Event Bus Design (Kafka)

Topic: order_management_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "order_management_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: order_management_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design an Order Management System: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Order Placement — Saga Pattern

Order placement uses SAGA (not 2PC). Step 1 creates order, Step 2 reserves inventory, Step 3 charges payment, Step 4 confirms. On failure, compensating transactions rollback in reverse order.

Placing an order involves multiple services. Use SAGA pattern (not 2PC):

Step 1: Create order (Order Service)
  INSERT order with status = 'PLACED'
  
Step 2: Reserve inventory (Inventory Service)
  For each item: reserve stock at selected warehouse
  If ANY item out of stock --> compensate: cancel order, release other reservations
  
Step 3: Charge payment (Payment Service)
  Authorize + capture payment
  If payment fails --> compensate: release all inventory reservations, mark order PAYMENT_FAILED
  
Step 4: Confirm order (Order Service)
  Update status = 'PAYMENT_CONFIRMED'
  Publish order-confirmed event
  
Step 5 (async): Generate invoice, send confirmation email, update analytics

Saga compensation (rollback):
  If Step 3 fails (payment declined):
    - Undo Step 2: release inventory reservations
    - Undo Step 1: mark order as PAYMENT_FAILED
    - Notify user: "Payment failed. Your items are still in your cart."

  If Step 2 fails (out of stock):
    - Undo Step 1: mark order as CANCELLED
    - Notify user: "Sorry, [item] is no longer available."

Why SAGA (not distributed transaction / 2PC)?
  2PC: requires all services to hold locks simultaneously --> blocks everything if one is slow
  SAGA: each step commits independently; compensating transactions undo on failure
  At e-commerce scale, SAGA is the industry standard (Amazon, Shopify, etc.)

Split Shipments

Split Shipments — Example:

An order with 3 items sourced from different warehouses is automatically split:
- Item A — only available at WH-NYC
- Item B — only available at WH-LAX
- Item C — available at both warehouses, allocated to WH-NYC (proximity)

Result: 2 independent shipments:
- Shipment 1 (WH-NYC): Item A + Item C → FedEx tracking #12345
- Shipment 2 (WH-LAX): Item B → UPS tracking #67890

Each shipment has its own tracking, carrier, state machine, and EDD.
Parent order state is derived:
- If any shipment SHIPPED → order = PARTIALLY_SHIPPED
- All shipments DELIVERED → order = DELIVERED
- Partial cancellation cancels one shipment independently.

Idempotent Order Placement

Problem: User clicks "Place Order" and network times out. User clicks again.
  Without idempotency: two identical orders created, charged twice.

Solution:
  Client generates idempotency_key (UUID) on checkout page load.
  Both clicks send same idempotency_key.
  
  Server:
    1. Check Redis: GET idempotency:{key}
       If exists --> return cached order (already processed)
    2. Check PostgreSQL: SELECT order_id FROM orders WHERE idempotency_key = ?
       If exists --> return existing order
    3. If neither exists --> proceed with order creation
    4. After creation: SET idempotency:{key} {order_id} in Redis (TTL 24h)
    
  Result: exactly one order, regardless of retries.

Place Order

POST /api/v1/orders
Idempotency-Key: "order-abc-123"
{
  "items": [
    {"sku_id": "SKU-123", "quantity": 2, "price": 29.99},
    {"sku_id": "SKU-456", "quantity": 1, "price": 49.99}
  ],
  "shipping_address": { "street": "...", "city": "...", "zip": "...", "country": "US" },
  "payment_method_id": "pm-uuid",
  "delivery_preference": "standard"
}
Response: 201 Created
{
  "order_id": "order-uuid",
  "status": "payment_pending",
  "estimated_delivery": "2026-03-18",
  "total": 109.97,
  "shipments": [
    {"shipment_id": "ship-1", "items": ["SKU-123","SKU-456"], "warehouse": "WH-NYC"}
  ]
}

Get Order Status

GET /api/v1/orders/{order_id}
Response: 200 OK
{
  "order_id": "order-uuid",
  "status": "shipped",
  "placed_at": "2026-03-14T10:00:00Z",
  "items": [...],
  "shipments": [
    {"shipment_id": "ship-1", "status": "in_transit", "carrier": "FedEx",
     "tracking_number": "794644790132", "estimated_delivery": "2026-03-18",
     "tracking_url": "https://fedex.com/track?id=794644790132"}
  ],
  "payment": { "method": "Visa ending 4242", "amount": 109.97, "status": "captured" }
}

Cancel Order

POST /api/v1/orders/{order_id}/cancel
{ "reason": "changed_mind" }
Response: 200 OK
{ "status": "cancelled", "refund_amount": 109.97, "refund_status": "processing" }

Initiate Return

POST /api/v1/orders/{order_id}/return
{
  "items": [{"sku_id": "SKU-123", "quantity": 1, "reason": "defective"}],
  "return_method": "mail"
}
Response: 200 OK
{
  "return_id": "ret-uuid",
  "return_label_url": "https://s3.../return-label.pdf",
  "refund_amount": 29.99,
  "refund_status": "pending_return_receipt"
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
402 Payment Required: insufficient funds
502 Bad Gateway: payment provider timeout; poll status endpoint

PostgreSQL: Source of Truth

CREATE TABLE orders (
    order_id        UUID PRIMARY KEY,
    user_id         UUID NOT NULL,
    status          VARCHAR(30) NOT NULL DEFAULT 'placed',
    subtotal        DECIMAL(10,2),
    tax             DECIMAL(10,2),
    shipping_cost   DECIMAL(10,2),
    total           DECIMAL(10,2),
    currency        CHAR(3) DEFAULT 'USD',
    shipping_address JSONB,
    payment_method_id VARCHAR(64),
    payment_status  VARCHAR(20),
    idempotency_key VARCHAR(64) UNIQUE,
    placed_at       TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    INDEX idx_user (user_id, placed_at DESC),
    INDEX idx_status (status)
);

CREATE TABLE order_items (
    item_id         BIGSERIAL PRIMARY KEY,
    order_id        UUID NOT NULL REFERENCES orders(order_id),
    sku_id          VARCHAR(50) NOT NULL,
    quantity        INT NOT NULL,
    unit_price      DECIMAL(10,2),
    total_price     DECIMAL(10,2),
    shipment_id     UUID,
    status          VARCHAR(20) DEFAULT 'active',
    INDEX idx_order (order_id)
);

CREATE TABLE shipments (
    shipment_id     UUID PRIMARY KEY,
    order_id        UUID NOT NULL REFERENCES orders(order_id),
    warehouse_id    VARCHAR(50),
    carrier         VARCHAR(20),
    tracking_number VARCHAR(64),
    status          VARCHAR(30) DEFAULT 'pending',
    shipped_at      TIMESTAMPTZ,
    delivered_at    TIMESTAMPTZ,
    INDEX idx_order (order_id),
    INDEX idx_tracking (tracking_number)
);

CREATE TABLE order_state_log (
    log_id          BIGSERIAL PRIMARY KEY,
    order_id        UUID NOT NULL,
    from_state      VARCHAR(30),
    to_state        VARCHAR(30) NOT NULL,
    actor           VARCHAR(50),
    reason          TEXT,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    INDEX idx_order (order_id, created_at)
);

CREATE TABLE returns (
    return_id       UUID PRIMARY KEY,
    order_id        UUID NOT NULL,
    user_id         UUID NOT NULL,
    status          VARCHAR(20) DEFAULT 'initiated',
    refund_amount   DECIMAL(10,2),
    reason          TEXT,
    return_label_url TEXT,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    INDEX idx_order (order_id)
);

Redis: Caching + Idempotency

# Order status cache (avoid DB hit for frequent polling)
order_status:{order_id}   --> Hash { status, tracking, updated_at }
TTL: 300

# Idempotency (prevent duplicate order placement)
idempotency:{key}         --> order_id (if already processed)
TTL: 86400

# User's recent orders (for quick display)
user_orders:{user_id}     --> List of order_ids (last 20)
TTL: 3600

Kafka Topics

Topic: order-events       (placed, confirmed, shipped, delivered -- all state transitions)
Topic: payment-events     (payment success/failure -- consumed by order service)
Topic: shipment-events    (tracking updates from carrier webhooks)
Topic: return-events      (return initiated, received, refunded)

Concern	Solution
Duplicate order	Idempotency key (unique per checkout attempt); DB UNIQUE constraint
Payment succeeds but order update fails	Saga with compensation: payment service publishes event; order service consumes and updates; if missed, reconciliation job retries
Inventory reserved but payment fails	Compensation step releases inventory; TTL-based reservation auto-expires
State machine corruption	Valid transitions enforced in code + DB CHECK constraint; state_log is immutable audit trail
Carrier webhook missed	Poll carrier API every 30 min for orders in SHIPPED state; reconcile
DB failover	PostgreSQL synchronous replication; promote standby with zero data loss

Saga vs 2PC for Distributed Order Placement

2PC (Two-Phase Commit):
  Coordinator asks all services: "Can you commit?"
  All say yes --> "Commit." All commit atomically.
  
  Problems at e-commerce scale:
  - Coordinator is SPOF
  - All services hold locks during prepare phase --> latency, blocking
  - If any service is slow --> ALL are blocked
  - Not supported across heterogeneous systems (different DBs, services)

Saga (Choreography or Orchestration):
  Each step commits locally. Failures trigger compensating transactions.
  
  Choreography: each service publishes event; next service reacts
    Order created --> Inventory listens, reserves stock --> Payment listens, charges
    Loose coupling but hard to debug (no central view of saga progress)
  
  Orchestration (recommended): central orchestrator coordinates steps
    Order Service calls Inventory, then Payment, then Shipping
    On failure: orchestrator calls compensating actions in reverse order
    Easy to monitor, debug, and modify

  Trade-off:
    2PC: strong consistency, but fragile and slow at scale
    Saga: eventual consistency during saga execution, but resilient and fast
    For e-commerce: Saga is the clear winner. Brief inconsistency during
    the 2-second order placement window is acceptable.

Order Search: Why Elasticsearch

Customer service needs: "Find all orders for user X with product Y shipped in March"

PostgreSQL can do this, but:
  - Full-text search across product names, addresses is slow
  - Complex compound filters with pagination are expensive
  - At 10M orders/day, queries slow down without heavy indexing

Elasticsearch:
  - Index order data (denormalized): order_id, user, items, status, dates
  - Support full-text search + filters + aggregations
  - Sub-200ms response for complex queries
  
  CDC pipeline: PostgreSQL --> Debezium --> Kafka --> ES consumer --> Elasticsearch
  Lag: < 5 seconds from order placed to searchable in ES
  
  Use PostgreSQL for: order placement, state transitions (ACID)
  Use Elasticsearch for: search, customer service dashboard, analytics queries

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.