Design an E-Commerce Platform (Amazon / Flipkart) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Classic commerce flow. Nail catalog/search separation, cart with inventory reservation TTL, order state machine, facet search, and payment orchestration with saga/compensation.
Arch 50	Add flash sale hot-SKU handling (#68), multi-warehouse inventory, and eventual consistency between search index and catalog.
Arch 75	Staff: global inventory split-brain prevention, idempotent payment reconciliation, and how to run Black Friday without overselling.

Interview Prompt

Design an e-commerce platform like Amazon. Users browse a product catalog, search with filters, add items to cart, checkout, and pay. Support inventory management so overselling doesn't occur.

Clarifying Questions (ask before designing)

Question	Why it matters
Catalog size and search QPS?	100M SKUs with 50K search QPS drives Elasticsearch cluster sizing vs SQL full-text.
Inventory model — single warehouse or multi-warehouse?	Multi-warehouse adds reservation per location and ship-from-nearest logic at checkout.
Cart reservation TTL — how long hold inventory?	15 min standard; too short = checkout anxiety; too long = inventory locked during flash sales.
Payment — authorize at checkout or capture at ship?	Auth-hold at checkout + capture at ship is standard; affects order state machine and compensation on cancel.

Scope

In scope

Product catalog and category hierarchy
Search with facets/filters
Shopping cart with inventory reservation
Order state machine (create → pay → fulfill → ship)
Payment orchestration with failure compensation

Out of scope (state explicitly)

Recommendation engine (#48)
Review/rating system (#70)
Warehouse management / WMS internals
Fraud detection (#75) — assume basic rules exist

Assumptions

100M SKUs, 10M DAU, 100K orders/day peak
50K search QPS peak, 10:1 browse-to-buy ratio
15-min cart reservation TTL
Single payment provider with auth + capture API

Product Catalog: Browse, search, filter products with details, images, reviews
Shopping Cart: Add/remove items, persist across sessions
Checkout & Payment: Place orders with multiple payment methods
Order Management: Track order status (placed → shipped → delivered → returned)
Inventory Management: Real-time stock tracking, prevent overselling
Search: Full-text search with filters (category, price range, rating, brand)
Recommendations: "Customers who bought this also bought...", "Frequently bought together"
Reviews & Ratings: Write/read product reviews
Seller Management: Sellers list products, manage inventory, view sales dashboards
Notifications: Order confirmation, shipping updates, delivery alerts

Metric	Calculation	Value
Products	Given	500M
DAU	Given	100M
Search QPS	Given	200K
Orders / day	Given	10M
Orders / sec	10M ÷ 86400 (+ peak factor)	~115 (peak 10K during flash sales)
Product page views / day	100M DAU × ~50 views	5B
Cart operations / day	100M DAU × ~5 ops	500M
Avg order value	Given	$50
Daily GMV	Given	$500M

Loading...

Product Catalog Service

Why MongoDB/DynamoDB: Products have highly variable schemas (clothing has size/color; electronics has specs; books have ISBN). Document DB handles schema flexibility well
Caching: Hot products in Redis (product pages accessed 1000s of times per minute)
Product data: title, description, images (CDN URLs), price, seller, attributes, category tree

Search Service (Elasticsearch)

Index: Product title, description, brand, category, tags
Filters (aggregations): category facets, price range, rating, availability, brand, seller
Ranking: BM25 text relevance × sales velocity × rating × sponsored boost
Autocomplete: Prefix matching on product names and brands

Cart Service

Logged-in users: Cart stored in Redis (fast, persistent with AOF): Key: cart:{user_id}, TTL: 30 days
Guest users: Cart stored in browser localStorage; merged on login
Price consistency: Cart stores the price at add-time; on checkout, re-validate current prices
Stock validation: On checkout, verify items are still in stock

Inventory Service: Critical for Correctness

Solution: Pessimistic locking with database transactions

SQL

BEGIN TRANSACTION;
  SELECT quantity FROM inventory WHERE product_id = ? AND seller_id = ? FOR UPDATE;
  UPDATE inventory SET quantity = quantity - ? WHERE product_id = ? AND seller_id = ?;
COMMIT;

For flash sales (extreme concurrency): Pre-load stock into Redis; DECR inventory:{product_id} is atomic → no race conditions.

Order Service: Saga-Based Orchestration

Loading...

1. Validate Cart       → Cart Service
2. Reserve Inventory   → Inventory Service (lock stock)
3. Calculate Total     → Pricing Service (apply coupons, tax, shipping)
4. Process Payment     → Payment Service (authorize)
5. Create Order        → Order Service (persist order)
6. Confirm Inventory   → Inventory Service (deduct stock)
7. Notify              → Notification Service (email, push)

If step 4 fails → Compensate: Release Inventory (step 2)
If step 5 fails → Compensate: Refund Payment + Release Inventory

Payment Service

Integrates with payment gateways (Stripe, PayPal, RazorPay)
Idempotency: Every payment attempt has a unique idempotency_key → retry-safe
Two-phase: Authorize on order placement → Capture on shipment confirmation

Recommendation Service

Collaborative filtering (batch Spark job): Mine co-purchase patterns from order history
Content-based: Similar products by attributes (brand, category, price range)
Session-based: Real-time recommendations from current session
Pre-computed recommendations stored in Redis → served in < 50ms

Async Event Pipeline (Kafka)

Loading...

Topics: order-events, product-events, inventory-events, user-events, click-events
Consumers: ES Indexer, Analytics Pipeline (ClickHouse), Recommendation Pipeline, Fraud Detection, Notification Service, Data Lake (S3)
Decouples services: if Notification Service is down, orders still process

Event Bus Design (Kafka)

Topic: ecommerce_platform-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "ecommerce_platform-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: ecommerce_platform-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design an E-Commerce Platform (Amazon / Flipkart): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Search Products

HTTP

GET /api/v1/products/search?q=wireless+headphones&category=electronics&price_min=20&price_max=200&sort=relevance&page=1

Get Product

HTTP

GET /api/v1/products/{product_id}
Response: 200 OK
{
  "product_id": "...",
  "title": "Sony WH-1000XM5",
  "price": 349.99,
  "discount_price": 279.99,
  "rating": 4.7,
  "in_stock": true,
  "attributes": {"color": "Black", "connectivity": "Bluetooth 5.2"}
}

Place Order

HTTP

POST /api/v1/orders
{
  "shipping_address_id": "addr-uuid",
  "payment_method_id": "pm-uuid",
  "promo_code": "SAVE10"
}
Response: 201 Created
{
  "order_id": "order-uuid",
  "status": "placed",
  "total": 299.99
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

MongoDB: Product Catalog

JSON

{
  "_id": "product-uuid",
  "title": "Sony WH-1000XM5 Headphones",
  "category_path": ["Electronics", "Audio", "Headphones"],
  "brand": "Sony",
  "price": 349.99,
  "discount_price": 279.99,
  "attributes": {"color": "Black", "connectivity": "Bluetooth 5.2"},
  "rating": 4.7,
  "review_count": 12543
}

PostgreSQL: Orders (ACID required)

SQL

CREATE TABLE orders (
    order_id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    status VARCHAR(20),
    subtotal DECIMAL(12,2),
    total DECIMAL(12,2),
    payment_status VARCHAR(20),
    placed_at TIMESTAMP,
    INDEX idx_user (user_id, placed_at DESC)
);

Redis: Cart & Flash Sale

# Cart
Key:    cart:{user_id}
Value:  Hash { product_id → JSON{quantity, price, seller_id} }
TTL:    2592000 (30 days)

# Flash sale inventory (atomic operations)
Key:    flash:stock:{product_id}
Value:  integer (remaining stock)

Concern	Solution
Overselling	Database-level CHECK (quantity >= 0) + Redis atomic DECR for flash sales
Order saga failure	Compensating transactions roll back in reverse order
Payment failure	Retry with idempotency key; if persistent → cancel order, release inventory
Inventory sync (Redis ↔ DB)	Kafka event on every inventory change; periodic reconciliation job
Search index lag	Acceptable for new products (appear within minutes)
Flash sale stampede	Redis handles stock check; rate limit orders per user; queue overflow to Kafka

Flash Sale Architecture

Pre-warming: Load limited stock (e.g., 1000 units) into Redis
Rate limiting: Max 1 purchase per user per product
Queue-based: Frontend shows "You're in the queue" → background processes orders sequentially
Redis atomic: DECR flash:stock:{product_id} -> if result >= 0, proceed; else, sold out
Async order processing: Order details queued in Kafka → order workers process

Why MongoDB/DynamoDB for Product Catalog (Not MySQL)?

Product data is inherently heterogeneous:
  - A T-shirt has: size, color, fabric, fit
  - A laptop has: RAM, CPU, screen_size, GPU, battery_life
  - A book has: ISBN, author, publisher, pages, language
  
MySQL approach:
  Option A: One giant table with 200+ nullable columns → sparse, ugly, slow
  Option B: EAV (Entity-Attribute-Value) pattern → terrible query performance, 
            no type safety, complex JOINs for every attribute
  Option C: JSON column → better, but indexing JSON is limited in MySQL
  
MongoDB / DynamoDB:
  - Each product is a document with ONLY the attributes it needs
  - No schema migration when adding a new product category
  - Rich querying on nested attributes: db.products.find({"attributes.RAM": "16GB"})
  - Secondary indexes on frequently filtered attributes

  // T-shirt document
  { "product_id": "...", "title": "...", "category": "clothing",
    "attributes": { "size": ["S","M","L","XL"], "color": "Blue", "fabric": "Cotton" }}
  
  // Laptop document  
  { "product_id": "...", "title": "...", "category": "electronics",
    "attributes": { "RAM": "16GB", "CPU": "M3 Pro", "screen": "14-inch", "GPU": "10-core" }}

When MySQL IS needed:
  - Orders (ACID transactions, relational: order → items → payments)
  - Inventory (strict consistency, CHECK constraints)
  - User accounts (relational: user → addresses → payment methods)
  → Use MySQL/PostgreSQL for transactional data, MongoDB for catalog

Inventory Management: The Hardest Problem

The overselling problem:
  100 users click "Buy" simultaneously for the last item in stock
  Without proper handling → 100 orders created, 99 customers disappointed

Approach 1: Database Lock (Pessimistic)
  BEGIN; SELECT quantity FROM inventory WHERE product_id=? FOR UPDATE;
  -- check quantity >= 1
  UPDATE inventory SET quantity = quantity - 1 WHERE product_id=?;
  COMMIT;
  
  ✓ Guaranteed correctness
  ✗ Lock contention: 100 concurrent requests serialize on one row
  ✗ DB connection pool exhaustion under high load
  ✗ Deadlocks possible with multi-item orders

Approach 2: Atomic Decrement (Optimistic) ⭐
  UPDATE inventory SET quantity = quantity - 1 
  WHERE product_id = ? AND quantity >= 1;
  -- If rows_affected = 1 → success
  -- If rows_affected = 0 → out of stock
  
  ✓ No explicit lock (DB handles atomically)
  ✓ Higher throughput than pessimistic
  ✓ No deadlocks
  ✗ Still hits DB for every request

Approach 3: Redis Pre-Decrement (Flash Sales) ⭐⭐
  Pre-load stock into Redis: SET stock:{product_id} 1000
  On purchase: result = DECR stock:{product_id}
  If result >= 0 → proceed with order (async write to DB)
  If result < 0 → INCR stock:{product_id}; return "sold out"
  
  ✓ Sub-millisecond response
  ✓ Handles 100K+ concurrent requests
  ✓ Perfect for flash sales / limited drops
  ✗ Redis and DB can diverge → need reconciliation
  ✗ If Redis crashes before DB write → sold items "reappear"
  
  Reconciliation: 
    Kafka event on every Redis DECR → async DB update
    Periodic job: compare Redis stock vs DB stock → fix discrepancies

Recommended: Approach 2 for normal operations. Approach 3 for flash sales.

Saga Pattern vs 2PC for Distributed Transactions

Order placement spans multiple services:
  1. Inventory Service (reserve stock)
  2. Payment Service (charge customer)
  3. Order Service (create order)
  4. Notification Service (send confirmation)

Two-Phase Commit (2PC):
  Coordinator asks all participants: "Can you commit?"
  All say YES → Coordinator: "COMMIT"
  Any says NO → Coordinator: "ABORT"
  
  ✗ Blocking: if coordinator crashes after "prepare", ALL participants hang
  ✗ High latency (multiple round trips)
  ✗ Tight coupling between services
  ✗ Not practical for microservices across different databases

Saga Pattern ⭐ (Recommended):
  Each step has a COMPENSATING action for rollback:
  
  Step 1: Reserve Inventory      ← Compensate: Release Inventory
  Step 2: Authorize Payment      ← Compensate: Void Authorization  
  Step 3: Create Order           ← Compensate: Cancel Order
  Step 4: Send Notification      ← (no compensation needed)
  
  If Step 2 fails:
    → Execute Compensate Step 1 (release inventory)
    → Return error to user
  
  Orchestration (centralized coordinator) vs Choreography (event-driven):
  
  Orchestration ⭐ (recommended for order flow):
    Order Service is the orchestrator
    Calls each service sequentially, handles failures
    ✓ Clear flow, easy to debug, centralized error handling
    ✗ Orchestrator is a single point of coordination
  
  Choreography (event-driven):
    Each service publishes events, next service reacts
    ✓ Loosely coupled
    ✗ Hard to trace the full flow, "spaghetti events"
    ✗ Harder to implement compensation

Search: Elasticsearch vs Database Full-Text Search

MySQL FULLTEXT index:
  ✓ Simple, no additional infrastructure
  ✗ Limited relevance scoring
  ✗ No faceted search (filter by brand AND price AND rating simultaneously)
  ✗ No fuzzy matching ("iphne" → "iPhone")
  ✗ No autocomplete suggestions
  ✗ Becomes slow with 500M+ products

Elasticsearch ⭐:
  ✓ BM25 relevance scoring (better than TFIDF)
  ✓ Faceted search: aggregations on brand, category, price range, rating
  ✓ Fuzzy matching, synonyms, stemming ("running shoes" matches "run shoe")
  ✓ Autocomplete via completion suggester
  ✓ Horizontally scalable (sharding across nodes)
  ✓ Near real-time indexing (< 1 second from DB write to searchable)
  ✗ Additional infrastructure cost and complexity
  ✗ Not a source of truth (must sync from primary DB)
  
  Sync pattern: 
    Product created/updated in MongoDB → publish to Kafka → 
    ES consumer reads event → indexes in Elasticsearch
    Lag: typically < 2 seconds

Cart Design: Redis vs Database vs Client-Side

Client-Side (localStorage):
  ✓ Zero server load
  ✓ Works offline
  ✗ Cart lost on device switch or browser clear
  ✗ No cross-device sync
  ✗ Can be manipulated (price tampering)
  Best for: Guest users, lightweight apps

Database (PostgreSQL/MySQL):
  ✓ Persistent across sessions and devices
  ✓ Reliable, ACID
  ✗ DB hit on every cart operation (add/remove/update)
  ✗ Overkill for ephemeral data (most carts are abandoned)
  Best for: When cart data feeds into analytics or ML

Redis ⭐ (Recommended):
  ✓ Sub-millisecond operations
  ✓ Persistent with AOF (survives restart)
  ✓ TTL: auto-expire abandoned carts after 30 days
  ✓ Hash data structure: HSET cart:{user_id} {product_id} {quantity}
  ✗ If Redis loses data → cart is lost (acceptable: user re-adds)
  Best for: Logged-in users at scale

Hybrid (Amazon's approach):
  - Guest: localStorage → merged into Redis on login
  - Logged-in: Redis (primary) + periodic backup to DB
  - Checkout: Re-validate all prices and stock from source of truth

Why Separate Read and Write Databases (CQRS)?

Without CQRS:
  One DB handles both product browsing (reads) and order processing (writes)
  Problem: Black Friday → orders spike → DB under write pressure → 
           product pages slow down → conversion drops

With CQRS (Command Query Responsibility Segregation):
  Write Model: PostgreSQL for orders, inventory (optimized for transactions)
  Read Model: Elasticsearch for search, Redis for product cache, 
              MongoDB read replicas for catalog browsing
  
  Sync: Write events → Kafka → update read models asynchronously
  
  ✓ Reads and writes scale independently
  ✓ Each model optimized for its access pattern
  ✓ Write DB under heavy load doesn't affect read performance
  ✗ Eventual consistency between write and read models
  ✗ More complex architecture
  
  The consistency gap:
    Product price updated → Kafka → Elasticsearch updated (1-2 second delay)
    During this gap, a user might see an old price
    Mitigation: On checkout, ALWAYS re-fetch price from write DB (source of truth)

SLOs & Error Budgets

Metric	Target	Rationale
Search p99 latency	< 200ms	Browse experience — direct revenue impact
Checkout success rate	99.5%	Includes payment + reservation — lost cart = lost revenue
Oversell rate	< 0.01%	Trust and fulfillment cost
Order event processing lag	< 30 sec	Downstream fulfillment trigger

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Elasticsearch cluster yellow — facet counts wrong	Search latency 3× baseline; missing facet buckets; unassigned shards alert	Route search to replica index; disable non-critical facets; accelerate replica recovery; product pages still serve from catalog cache
Reservation leak — inventory counts drift to zero without sales	Available stock = 0 but no pending orders; reservation table has expired rows not released	Run release job manually; fix TTL reaper bug; reconcile Redis counters from DB; pause sales on affected SKUs
Payment double-charge on retry	Customer support tickets; payment provider shows duplicate auth for same order_id	Idempotency keys enforced at gateway; void duplicate auths; root cause in client retry without idempotency header

Cost Drivers (Staff lens)

Elasticsearch: 100M docs × 3 KB avg × 3 replicas ≈ 900 GB RAM cluster — dominant search cost
Payment fees: 100K orders × $50 AOV × 2.9% ≈ $145K/day
Kafka order events: low volume vs catalog CDC — CDC at 10K changes/sec during reindex is the spike

Multi-Region & DR

Catalog replicated globally (read-heavy). Inventory partitioned by warehouse region — checkout routes to regional inventory service. Orders pinned to region of ship-from warehouse. Payment in local currency via regional provider. Search index per region (locale-specific ranking).