This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Classic commerce flow. Nail catalog/search separation, cart with inventory reservation TTL, order state machine, facet search, and payment orchestration with saga/compensation. |
| Arch 50 | Add flash sale hot-SKU handling (#68), multi-warehouse inventory, and eventual consistency between search index and catalog. |
| Arch 75 | Staff: global inventory split-brain prevention, idempotent payment reconciliation, and how to run Black Friday without overselling. |
Interview Prompt
Design an e-commerce platform like Amazon. Users browse a product catalog, search with filters, add items to cart, checkout, and pay. Support inventory management so overselling doesn't occur.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Catalog size and search QPS? | 100M SKUs with 50K search QPS drives Elasticsearch cluster sizing vs SQL full-text. |
| Inventory model — single warehouse or multi-warehouse? | Multi-warehouse adds reservation per location and ship-from-nearest logic at checkout. |
| Cart reservation TTL — how long hold inventory? | 15 min standard; too short = checkout anxiety; too long = inventory locked during flash sales. |
| Payment — authorize at checkout or capture at ship? | Auth-hold at checkout + capture at ship is standard; affects order state machine and compensation on cancel. |
Scope
In scope
- Product catalog and category hierarchy
- Search with facets/filters
- Shopping cart with inventory reservation
- Order state machine (create → pay → fulfill → ship)
- Payment orchestration with failure compensation
Out of scope (state explicitly)
- Recommendation engine (#48)
- Review/rating system (#70)
- Warehouse management / WMS internals
- Fraud detection (#75) — assume basic rules exist
Assumptions
- 100M SKUs, 10M DAU, 100K orders/day peak
- 50K search QPS peak, 10:1 browse-to-buy ratio
- 15-min cart reservation TTL
- Single payment provider with auth + capture API
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Product Catalog: Browse, search, filter products with details, images, reviews
- Shopping Cart: Add/remove items, persist across sessions
- Checkout & Payment: Place orders with multiple payment methods
- Order Management: Track order status (placed → shipped → delivered → returned)
- Inventory Management: Real-time stock tracking, prevent overselling
- Search: Full-text search with filters (category, price range, rating, brand)
- Recommendations: "Customers who bought this also bought...", "Frequently bought together"
- Reviews & Ratings: Write/read product reviews
- Seller Management: Sellers list products, manage inventory, view sales dashboards
- Notifications: Order confirmation, shipping updates, delivery alerts
- High Availability: 99.99%: downtime = lost revenue
- Low Latency: Search results < 200 ms, page loads < 500 ms
- Scalability: 500M+ products, 100M+ DAU, 10M+ orders/day
- Strong Consistency: Inventory and payments MUST be strongly consistent
- Eventual Consistency: Product reviews, recommendations can be eventually consistent
- Flash Sale Support: Handle 100× traffic spikes during sales events (Prime Day, Black Friday)
- Idempotent Payments: No double-charging
| Metric | Calculation | Value |
|---|---|---|
| Products | Given | 500M |
| DAU | Given | 100M |
| Search QPS | Given | 200K |
| Orders / day | Given | 10M |
| Orders / sec | 10M ÷ 86400 (+ peak factor) | ~115 (peak 10K during flash sales) |
| Product page views / day | 100M DAU × ~50 views | 5B |
| Cart operations / day | 100M DAU × ~5 ops | 500M |
| Avg order value | Given | $50 |
| Daily GMV | Given | $500M |
Product Catalog Service
- Why MongoDB/DynamoDB: Products have highly variable schemas (clothing has size/color; electronics has specs; books have ISBN). Document DB handles schema flexibility well
- Caching: Hot products in Redis (product pages accessed 1000s of times per minute)
- Product data: title, description, images (CDN URLs), price, seller, attributes, category tree
Search Service (Elasticsearch)
- Index: Product title, description, brand, category, tags
- Filters (aggregations): category facets, price range, rating, availability, brand, seller
- Ranking: BM25 text relevance × sales velocity × rating × sponsored boost
- Autocomplete: Prefix matching on product names and brands
Cart Service
- Logged-in users: Cart stored in Redis (fast, persistent with AOF): Key:
cart:{user_id}, TTL: 30 days - Guest users: Cart stored in browser localStorage; merged on login
- Price consistency: Cart stores the price at add-time; on checkout, re-validate current prices
- Stock validation: On checkout, verify items are still in stock
Inventory Service: Critical for Correctness
Solution: Pessimistic locking with database transactions
BEGIN TRANSACTION;
SELECT quantity FROM inventory WHERE product_id = ? AND seller_id = ? FOR UPDATE;
UPDATE inventory SET quantity = quantity - ? WHERE product_id = ? AND seller_id = ?;
COMMIT;For flash sales (extreme concurrency): Pre-load stock into Redis; DECR inventory:{product_id} is atomic → no race conditions.
Order Service: Saga-Based Orchestration
1. Validate Cart → Cart Service 2. Reserve Inventory → Inventory Service (lock stock) 3. Calculate Total → Pricing Service (apply coupons, tax, shipping) 4. Process Payment → Payment Service (authorize) 5. Create Order → Order Service (persist order) 6. Confirm Inventory → Inventory Service (deduct stock) 7. Notify → Notification Service (email, push) If step 4 fails → Compensate: Release Inventory (step 2) If step 5 fails → Compensate: Refund Payment + Release Inventory
Payment Service
- Integrates with payment gateways (Stripe, PayPal, RazorPay)
- Idempotency: Every payment attempt has a unique
idempotency_key→ retry-safe - Two-phase: Authorize on order placement → Capture on shipment confirmation
Recommendation Service
- Collaborative filtering (batch Spark job): Mine co-purchase patterns from order history
- Content-based: Similar products by attributes (brand, category, price range)
- Session-based: Real-time recommendations from current session
- Pre-computed recommendations stored in Redis → served in < 50ms
Async Event Pipeline (Kafka)
- Topics:
order-events,product-events,inventory-events,user-events,click-events - Consumers: ES Indexer, Analytics Pipeline (ClickHouse), Recommendation Pipeline, Fraud Detection, Notification Service, Data Lake (S3)
- Decouples services: if Notification Service is down, orders still process
Event Bus Design (Kafka)
Topic: ecommerce_platform-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "ecommerce_platform-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: ecommerce_platform-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design an E-Commerce Platform (Amazon / Flipkart): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Search Products
GET /api/v1/products/search?q=wireless+headphones&category=electronics&price_min=20&price_max=200&sort=relevance&page=1Get Product
GET /api/v1/products/{product_id}
Response: 200 OK
{
"product_id": "...",
"title": "Sony WH-1000XM5",
"price": 349.99,
"discount_price": 279.99,
"rating": 4.7,
"in_stock": true,
"attributes": {"color": "Black", "connectivity": "Bluetooth 5.2"}
}Place Order
POST /api/v1/orders
{
"shipping_address_id": "addr-uuid",
"payment_method_id": "pm-uuid",
"promo_code": "SAVE10"
}
Response: 201 Created
{
"order_id": "order-uuid",
"status": "placed",
"total": 299.99
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
MongoDB: Product Catalog
{
"_id": "product-uuid",
"title": "Sony WH-1000XM5 Headphones",
"category_path": ["Electronics", "Audio", "Headphones"],
"brand": "Sony",
"price": 349.99,
"discount_price": 279.99,
"attributes": {"color": "Black", "connectivity": "Bluetooth 5.2"},
"rating": 4.7,
"review_count": 12543
}PostgreSQL: Orders (ACID required)
CREATE TABLE orders (
order_id UUID PRIMARY KEY,
user_id UUID NOT NULL,
status VARCHAR(20),
subtotal DECIMAL(12,2),
total DECIMAL(12,2),
payment_status VARCHAR(20),
placed_at TIMESTAMP,
INDEX idx_user (user_id, placed_at DESC)
);Redis: Cart & Flash Sale
# Cart
Key: cart:{user_id}
Value: Hash { product_id → JSON{quantity, price, seller_id} }
TTL: 2592000 (30 days)
# Flash sale inventory (atomic operations)
Key: flash:stock:{product_id}
Value: integer (remaining stock)| Concern | Solution |
|---|---|
| Overselling | Database-level CHECK (quantity >= 0) + Redis atomic DECR for flash sales |
| Order saga failure | Compensating transactions roll back in reverse order |
| Payment failure | Retry with idempotency key; if persistent → cancel order, release inventory |
| Inventory sync (Redis ↔ DB) | Kafka event on every inventory change; periodic reconciliation job |
| Search index lag | Acceptable for new products (appear within minutes) |
| Flash sale stampede | Redis handles stock check; rate limit orders per user; queue overflow to Kafka |
Flash Sale Architecture
- Pre-warming: Load limited stock (e.g., 1000 units) into Redis
- Rate limiting: Max 1 purchase per user per product
- Queue-based: Frontend shows "You're in the queue" → background processes orders sequentially
- Redis atomic: DECR flash:stock:{product_id} -> if result >= 0, proceed; else, sold out
- Async order processing: Order details queued in Kafka → order workers process
Multi-Seller / Marketplace Model
- One order can contain items from multiple sellers: each shipped independently
- Payment split: Platform takes 15-30% commission; rest goes to seller
Warehouse & Fulfillment
- Multiple warehouses per region → route order to nearest warehouse with stock
- Fulfillment service tracks: pick → pack → ship → in-transit → delivered
Price and Promotion Engine
- Dynamic pricing: Competitor monitoring, demand-based pricing
- Coupon types: flat discount, percentage, free shipping, buy-one-get-one
- Promo validation on checkout: verify eligibility, expiry, usage limits
Interview Walkthrough
- Open with catalog browse (massively read-heavy) vs checkout (write-heavy, consistency-critical) — two traffic shapes, two paths.
- Explain CQRS: write model for orders/inventory, read-optimized replicas or search index for product discovery.
- Cover inventory reservation with Distributed Lock Manager or DB row lock during checkout — overselling is the failure mode interviewers probe.
- Discuss search via Elasticsearch synced by CDC pipeline, not SQL LIKE queries on the primary DB.
- Mention cart as session-scoped Redis state with async persistence, decoupled from order placement.
- Common pitfall: single monolithic database for catalog, cart, and orders — peak sale traffic on reads kills checkout writes.
Why MongoDB/DynamoDB for Product Catalog (Not MySQL)?
Product data is inherently heterogeneous:
- A T-shirt has: size, color, fabric, fit
- A laptop has: RAM, CPU, screen_size, GPU, battery_life
- A book has: ISBN, author, publisher, pages, language
MySQL approach:
Option A: One giant table with 200+ nullable columns → sparse, ugly, slow
Option B: EAV (Entity-Attribute-Value) pattern → terrible query performance,
no type safety, complex JOINs for every attribute
Option C: JSON column → better, but indexing JSON is limited in MySQL
MongoDB / DynamoDB:
- Each product is a document with ONLY the attributes it needs
- No schema migration when adding a new product category
- Rich querying on nested attributes: db.products.find({"attributes.RAM": "16GB"})
- Secondary indexes on frequently filtered attributes
// T-shirt document
{ "product_id": "...", "title": "...", "category": "clothing",
"attributes": { "size": ["S","M","L","XL"], "color": "Blue", "fabric": "Cotton" }}
// Laptop document
{ "product_id": "...", "title": "...", "category": "electronics",
"attributes": { "RAM": "16GB", "CPU": "M3 Pro", "screen": "14-inch", "GPU": "10-core" }}
When MySQL IS needed:
- Orders (ACID transactions, relational: order → items → payments)
- Inventory (strict consistency, CHECK constraints)
- User accounts (relational: user → addresses → payment methods)
→ Use MySQL/PostgreSQL for transactional data, MongoDB for catalogInventory Management: The Hardest Problem
The overselling problem:
100 users click "Buy" simultaneously for the last item in stock
Without proper handling → 100 orders created, 99 customers disappointed
Approach 1: Database Lock (Pessimistic)
BEGIN; SELECT quantity FROM inventory WHERE product_id=? FOR UPDATE;
-- check quantity >= 1
UPDATE inventory SET quantity = quantity - 1 WHERE product_id=?;
COMMIT;
✓ Guaranteed correctness
✗ Lock contention: 100 concurrent requests serialize on one row
✗ DB connection pool exhaustion under high load
✗ Deadlocks possible with multi-item orders
Approach 2: Atomic Decrement (Optimistic) ⭐
UPDATE inventory SET quantity = quantity - 1
WHERE product_id = ? AND quantity >= 1;
-- If rows_affected = 1 → success
-- If rows_affected = 0 → out of stock
✓ No explicit lock (DB handles atomically)
✓ Higher throughput than pessimistic
✓ No deadlocks
✗ Still hits DB for every request
Approach 3: Redis Pre-Decrement (Flash Sales) ⭐⭐
Pre-load stock into Redis: SET stock:{product_id} 1000
On purchase: result = DECR stock:{product_id}
If result >= 0 → proceed with order (async write to DB)
If result < 0 → INCR stock:{product_id}; return "sold out"
✓ Sub-millisecond response
✓ Handles 100K+ concurrent requests
✓ Perfect for flash sales / limited drops
✗ Redis and DB can diverge → need reconciliation
✗ If Redis crashes before DB write → sold items "reappear"
Reconciliation:
Kafka event on every Redis DECR → async DB update
Periodic job: compare Redis stock vs DB stock → fix discrepancies
Recommended: Approach 2 for normal operations. Approach 3 for flash sales.Saga Pattern vs 2PC for Distributed Transactions
Order placement spans multiple services:
1. Inventory Service (reserve stock)
2. Payment Service (charge customer)
3. Order Service (create order)
4. Notification Service (send confirmation)
Two-Phase Commit (2PC):
Coordinator asks all participants: "Can you commit?"
All say YES → Coordinator: "COMMIT"
Any says NO → Coordinator: "ABORT"
✗ Blocking: if coordinator crashes after "prepare", ALL participants hang
✗ High latency (multiple round trips)
✗ Tight coupling between services
✗ Not practical for microservices across different databases
Saga Pattern ⭐ (Recommended):
Each step has a COMPENSATING action for rollback:
Step 1: Reserve Inventory ← Compensate: Release Inventory
Step 2: Authorize Payment ← Compensate: Void Authorization
Step 3: Create Order ← Compensate: Cancel Order
Step 4: Send Notification ← (no compensation needed)
If Step 2 fails:
→ Execute Compensate Step 1 (release inventory)
→ Return error to user
Orchestration (centralized coordinator) vs Choreography (event-driven):
Orchestration ⭐ (recommended for order flow):
Order Service is the orchestrator
Calls each service sequentially, handles failures
✓ Clear flow, easy to debug, centralized error handling
✗ Orchestrator is a single point of coordination
Choreography (event-driven):
Each service publishes events, next service reacts
✓ Loosely coupled
✗ Hard to trace the full flow, "spaghetti events"
✗ Harder to implement compensationSearch: Elasticsearch vs Database Full-Text Search
MySQL FULLTEXT index:
✓ Simple, no additional infrastructure
✗ Limited relevance scoring
✗ No faceted search (filter by brand AND price AND rating simultaneously)
✗ No fuzzy matching ("iphne" → "iPhone")
✗ No autocomplete suggestions
✗ Becomes slow with 500M+ products
Elasticsearch ⭐:
✓ BM25 relevance scoring (better than TFIDF)
✓ Faceted search: aggregations on brand, category, price range, rating
✓ Fuzzy matching, synonyms, stemming ("running shoes" matches "run shoe")
✓ Autocomplete via completion suggester
✓ Horizontally scalable (sharding across nodes)
✓ Near real-time indexing (< 1 second from DB write to searchable)
✗ Additional infrastructure cost and complexity
✗ Not a source of truth (must sync from primary DB)
Sync pattern:
Product created/updated in MongoDB → publish to Kafka →
ES consumer reads event → indexes in Elasticsearch
Lag: typically < 2 secondsCart Design: Redis vs Database vs Client-Side
Client-Side (localStorage):
✓ Zero server load
✓ Works offline
✗ Cart lost on device switch or browser clear
✗ No cross-device sync
✗ Can be manipulated (price tampering)
Best for: Guest users, lightweight apps
Database (PostgreSQL/MySQL):
✓ Persistent across sessions and devices
✓ Reliable, ACID
✗ DB hit on every cart operation (add/remove/update)
✗ Overkill for ephemeral data (most carts are abandoned)
Best for: When cart data feeds into analytics or ML
Redis ⭐ (Recommended):
✓ Sub-millisecond operations
✓ Persistent with AOF (survives restart)
✓ TTL: auto-expire abandoned carts after 30 days
✓ Hash data structure: HSET cart:{user_id} {product_id} {quantity}
✗ If Redis loses data → cart is lost (acceptable: user re-adds)
Best for: Logged-in users at scale
Hybrid (Amazon's approach):
- Guest: localStorage → merged into Redis on login
- Logged-in: Redis (primary) + periodic backup to DB
- Checkout: Re-validate all prices and stock from source of truthWhy Separate Read and Write Databases (CQRS)?
Without CQRS:
One DB handles both product browsing (reads) and order processing (writes)
Problem: Black Friday → orders spike → DB under write pressure →
product pages slow down → conversion drops
With CQRS (Command Query Responsibility Segregation):
Write Model: PostgreSQL for orders, inventory (optimized for transactions)
Read Model: Elasticsearch for search, Redis for product cache,
MongoDB read replicas for catalog browsing
Sync: Write events → Kafka → update read models asynchronously
✓ Reads and writes scale independently
✓ Each model optimized for its access pattern
✓ Write DB under heavy load doesn't affect read performance
✗ Eventual consistency between write and read models
✗ More complex architecture
The consistency gap:
Product price updated → Kafka → Elasticsearch updated (1-2 second delay)
During this gap, a user might see an old price
Mitigation: On checkout, ALWAYS re-fetch price from write DB (source of truth)Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Monolith commerce
Single Rails/Java monolith: catalog + cart + orders + inventory in one PostgreSQL. SQL LIKE search. Synchronous payment. Simple status enum on orders.
Key components: Monolith · PostgreSQL · SQL search · Sync payment
Move to next phase when: Search p99 > 2 sec at 1M SKUs; oversell incident during promotion
Phase 2 — Service split + search index
Extract Catalog, Inventory, Order, Payment services. Elasticsearch for search/facets. Redis cart + reservation TTL. Kafka for order events. Payment saga with auth/capture split.
Key components: Microservices · Elasticsearch · Redis reservations · Kafka events · Payment saga
Move to next phase when: Black Friday: 10× order volume; inventory contention causes 503s
Phase 3 — Scale + multi-warehouse
Multi-warehouse inventory with ship-from-nearest at checkout. Redis hot-SKU counters for flash sales. CockroachDB for global inventory. Temporal for durable sagas. Real-time stock sync to ES via CDC.
Key components: Multi-warehouse inventory · Redis hot counters · Temporal sagas · CDC to ES · Global DB
Move to next phase when: International expansion requires inventory split across regions
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Search p99 latency | < 200ms | Browse experience — direct revenue impact |
| Checkout success rate | 99.5% | Includes payment + reservation — lost cart = lost revenue |
| Oversell rate | < 0.01% | Trust and fulfillment cost |
| Order event processing lag | < 30 sec | Downstream fulfillment trigger |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Elasticsearch cluster yellow — facet counts wrong | Search latency 3× baseline; missing facet buckets; unassigned shards alert | Route search to replica index; disable non-critical facets; accelerate replica recovery; product pages still serve from catalog cache |
| Reservation leak — inventory counts drift to zero without sales | Available stock = 0 but no pending orders; reservation table has expired rows not released | Run release job manually; fix TTL reaper bug; reconcile Redis counters from DB; pause sales on affected SKUs |
| Payment double-charge on retry | Customer support tickets; payment provider shows duplicate auth for same order_id | Idempotency keys enforced at gateway; void duplicate auths; root cause in client retry without idempotency header |
Cost Drivers (Staff lens)
Multi-Region & DR
Catalog replicated globally (read-heavy). Inventory partitioned by warehouse region — checkout routes to regional inventory service. Orders pinned to region of ship-from warehouse. Payment in local currency via regional provider. Search index per region (locale-specific ranking).